Re: [R] which rows are duplicates?

Dimitris Rizopoulos Tue, 31 Mar 2009 05:31:03 -0700

Wacek Kusnierczyk wrote:

Wacek Kusnierczyk wrote:

Michael Dewey wrote:

At 05:07 30/03/2009, Aaron M. Swoboda wrote:

I would like to know which rows are duplicates of each other, not
simply that a row is duplicate of another row. In the following
example rows 1 and 3 are duplicates.

x <- c(1,3,1)
y <- c(2,4,2)
z <- c(3,4,3)
data <- data.frame(x,y,z)

i don't have any solution significantly better than what you have

already been given.


i now seem to have one:

    # dummy data
    data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace=TRUE))

# add a class column; identical rows have the same class id

    data$class = local({
        rows = do.call('paste', c(data, sep='\r'))
        with(
            rle(sort(rows)),
            rep(1:length(values), lengths)[rank(rows)] ) })

    data
    #   x y class
    # 1 2 2     3
    # 2 2 1     2
    # 3 2 1     2
    # 4 1 2     1
    # 5 2 2     3


another approach (maybe a bit cleaner) seems to be:

data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,replace = TRUE))


vals <- do.call('paste', c(data, sep = '\r'))
data$class <- match(vals, unique(vals))
data


I have tried benchmarking it.

Best,
Dimitris

this approach seems to be roughly comparable to michael's, depending on
the shape (and size?) of the input:

    # dummy data frame, just integers
    n = 100; m = 100
    data = as.data.frame(
        matrix(nrow=n, ncol=m,
            sample(n, m*n, replace=TRUE)))

    # do a simple benchmarking
    library(rbenchmark)
    benchmark(replications=100, order='elapsed', columns=c('test',
'elapsed'),
        waku=local({
            rows = do.call('paste', c(data, sep='\r'))
            data$class = with(
                rle(sort(rows)),
                rep(1:length(values), lengths)[rank(rows)] ) }),
        mide=local({
            unique = unique(data)
            data = merge(data, cbind(unique, class=1:nrow(unique))) }))

    #   test elapsed
    # 1 waku   0.503
    # 2 mide   3.269

and for m = 10 and n = 1000 i get:

    #   test elapsed
    # 1 waku   0.571
    # 2 mide  15.836

while for m = 1000 and n = 10 i get:

    #   test elapsed
    # 1 waku   1.110
    # 2 mide   2.461

the type of the content should not have any impact on the ratio (pure

guess, no testing done).

whether my approach is more intuitive is arguable.  note that, unlike in
michael's solution, the final result (the data frame with a class column
added) is in the original order.  (and sorting would add a performance
penalty in the other case.)

my previous remarks about the treatment on NAs still apply;  the
do.call('paste', ... is taken from duplicated.data.frame.

regards,
vQ

Does this do what you want?

x <- c(1,3,1)
y <- c(2,4,2)
z <- c(3,4,3)
data <- data.frame(x,y,z)
data.u <- unique(data)
data.u

  x y z
1 1 2 3
2 3 4 4

data.u <- cbind(data.u, set = 1:nrow(data.u))
merge(data, data.u)

  x y z set
1 1 2 3   1
2 1 2 3   1
3 3 4 4   2

You need to do a bit more work to get them back into the original row
order if that is essential.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center

Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] which rows are duplicates?

Reply via email to