Wacek Kusnierczyk wrote: > Michael Dewey wrote: > >> At 05:07 30/03/2009, Aaron M. Swoboda wrote: >> >>> I would like to know which rows are duplicates of each other, not >>> simply that a row is duplicate of another row. In the following >>> example rows 1 and 3 are duplicates. >>> >>> >>>> x <- c(1,3,1) >>>> y <- c(2,4,2) >>>> z <- c(3,4,3) >>>> data <- data.frame(x,y,z) >>>> >>> x y z >>> 1 1 2 3 >>> 2 3 4 4 >>> 3 1 2 3 >>> > > i don't have any solution significantly better than what you have > already been given.
i now seem to have one: # dummy data data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5, replace=TRUE)) # add a class column; identical rows have the same class id data$class = local({ rows = do.call('paste', c(data, sep='\r')) with( rle(sort(rows)), rep(1:length(values), lengths)[rank(rows)] ) }) data # x y class # 1 2 2 3 # 2 2 1 2 # 3 2 1 2 # 4 1 2 1 # 5 2 2 3 this approach seems to be roughly comparable to michael's, depending on the shape (and size?) of the input: # dummy data frame, just integers n = 100; m = 100 data = as.data.frame( matrix(nrow=n, ncol=m, sample(n, m*n, replace=TRUE))) # do a simple benchmarking library(rbenchmark) benchmark(replications=100, order='elapsed', columns=c('test', 'elapsed'), waku=local({ rows = do.call('paste', c(data, sep='\r')) data$class = with( rle(sort(rows)), rep(1:length(values), lengths)[rank(rows)] ) }), mide=local({ unique = unique(data) data = merge(data, cbind(unique, class=1:nrow(unique))) })) # test elapsed # 1 waku 0.503 # 2 mide 3.269 and for m = 10 and n = 1000 i get: # test elapsed # 1 waku 0.571 # 2 mide 15.836 while for m = 1000 and n = 10 i get: # test elapsed # 1 waku 1.110 # 2 mide 2.461 the type of the content should not have any impact on the ratio (pure guess, no testing done). whether my approach is more intuitive is arguable. note that, unlike in michael's solution, the final result (the data frame with a class column added) is in the original order. (and sorting would add a performance penalty in the other case.) my previous remarks about the treatment on NAs still apply; the do.call('paste', ... is taken from duplicated.data.frame. regards, vQ >> Does this do what you want? >> >>> x <- c(1,3,1) >>> y <- c(2,4,2) >>> z <- c(3,4,3) >>> data <- data.frame(x,y,z) >>> data.u <- unique(data) >>> data.u >>> >> x y z >> 1 1 2 3 >> 2 3 4 4 >> >>> data.u <- cbind(data.u, set = 1:nrow(data.u)) >>> merge(data, data.u) >>> >> x y z set >> 1 1 2 3 1 >> 2 1 2 3 1 >> 3 3 4 4 2 >> >> You need to do a bit more work to get them back into the original row >> order if that is essential. >> ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.