On Fri, Apr 08, 2011 at 10:59:10AM -0400, Duncan Murdoch wrote: > I need a function which is similar to duplicated(), but instead of > returning TRUE/FALSE, returns indices of which element was duplicated. > That is, > > > x <- c(9,7,9,3,7) > > duplicated(x) > [1] FALSE FALSE TRUE FALSE TRUE > > > duplicates(x) > [1] NA NA 1 NA 2 > > (so that I know that element 3 is a duplicate of element 1, and element > 5 is a duplicate of element 2, whereas the others were not duplicated > according to our definition.) > > Is there a simple way to write this function?
A possible strategy is to use sorting. In a sorted matrix or data frame, the elements, which are duplicates of the same element, form consecutive blocks. These blocks may be identified using !duplicated(), which determines the first elements of these blocks. Since sorting is stable, when we map these blocks back to the original order, the first element of each block is mapped to the first ocurrence of the given row in the original order. An implementation may be done as follows. duplicates <- function(dat) { s <- do.call("order", as.data.frame(dat)) non.dup <- !duplicated(dat[s, ]) orig.ind <- s[non.dup] first.occ <- orig.ind[cumsum(non.dup)] first.occ[non.dup] <- NA first.occ[order(s)] } x <- cbind(1, c(9,7,9,3,7) ) duplicates(x) [1] NA NA 1 NA 2 The line orig.ind <- s[non.dup] creates a vector, whose length is the number of non-duplicated rows in the sorted "dat". Its components are indices of the corresponding first occurrences of these rows in the original order. For this, the stability of the order is needed. The lines first.occ <- orig.ind[cumsum(non.dup)] first.occ[non.dup] <- NA expand orig.ind to a vector, which satisfies: If i-th row of the sorted "dat" is duplicated, then first.occ[i] is the index of the first row in the original "dat", which is equal to this row. So, the values in first.occ are those, which are required for the output of duplicates(), but they are in the order of the sorted "dat". The last line first.occ[order(s)] reorders the vector to the original order of the rows. Petr Savicky. ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel