Wacek Kusnierczyk wrote:
Wacek Kusnierczyk wrote:
Michael Dewey wrote:
At 05:07 30/03/2009, Aaron M. Swoboda wrote:
I would like to know which rows are duplicates of each other, not
simply that a row is duplicate of another row. In the following
example rows 1 and 3 are duplicates.
x <- c(1,3,1)
y <- c(2,4,2)
z <- c(3,4,3)
data <- data.frame(x,y,z)
x y z
1 1 2 3
2 3 4 4
3 1 2 3
i don't have any solution significantly better than what you have
already been given.
i now seem to have one:
# dummy data
data = data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace=TRUE))
# add a class column; identical rows have the same class id
data$class = local({
rows = do.call('paste', c(data, sep='\r'))
with(
rle(sort(rows)),
rep(1:length(values), lengths)[rank(rows)] ) })
data
# x y class
# 1 2 2 3
# 2 2 1 2
# 3 2 1 2
# 4 1 2 1
# 5 2 2 3
another approach (maybe a bit cleaner) seems to be:
data <- data.frame(x=sample(1:2, 5, replace=TRUE), y=sample(1:2, 5,
replace = TRUE))
vals <- do.call('paste', c(data, sep = '\r'))
data$class <- match(vals, unique(vals))
data
I have tried benchmarking it.
Best,
Dimitris
this approach seems to be roughly comparable to michael's, depending on
the shape (and size?) of the input:
# dummy data frame, just integers
n = 100; m = 100
data = as.data.frame(
matrix(nrow=n, ncol=m,
sample(n, m*n, replace=TRUE)))
# do a simple benchmarking
library(rbenchmark)
benchmark(replications=100, order='elapsed', columns=c('test',
'elapsed'),
waku=local({
rows = do.call('paste', c(data, sep='\r'))
data$class = with(
rle(sort(rows)),
rep(1:length(values), lengths)[rank(rows)] ) }),
mide=local({
unique = unique(data)
data = merge(data, cbind(unique, class=1:nrow(unique))) }))
# test elapsed
# 1 waku 0.503
# 2 mide 3.269
and for m = 10 and n = 1000 i get:
# test elapsed
# 1 waku 0.571
# 2 mide 15.836
while for m = 1000 and n = 10 i get:
# test elapsed
# 1 waku 1.110
# 2 mide 2.461
the type of the content should not have any impact on the ratio (pure
guess, no testing done).
whether my approach is more intuitive is arguable. note that, unlike in
michael's solution, the final result (the data frame with a class column
added) is in the original order. (and sorting would add a performance
penalty in the other case.)
my previous remarks about the treatment on NAs still apply; the
do.call('paste', ... is taken from duplicated.data.frame.
regards,
vQ
Does this do what you want?
x <- c(1,3,1)
y <- c(2,4,2)
z <- c(3,4,3)
data <- data.frame(x,y,z)
data.u <- unique(data)
data.u
x y z
1 1 2 3
2 3 4 4
data.u <- cbind(data.u, set = 1:nrow(data.u))
merge(data, data.u)
x y z set
1 1 2 3 1
2 1 2 3 1
3 3 4 4 2
You need to do a bit more work to get them back into the original row
order if that is essential.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Dimitris Rizopoulos
Assistant Professor
Department of Biostatistics
Erasmus University Medical Center
Address: PO Box 2040, 3000 CA Rotterdam, the Netherlands
Tel: +31/(0)10/7043478
Fax: +31/(0)10/7043014
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.