Michael Dewey wrote: > At 05:07 30/03/2009, Aaron M. Swoboda wrote: >> I would like to know which rows are duplicates of each other, not >> simply that a row is duplicate of another row. In the following >> example rows 1 and 3 are duplicates. >> >> > x <- c(1,3,1) >> > y <- c(2,4,2) >> > z <- c(3,4,3) >> > data <- data.frame(x,y,z) >> x y z >> 1 1 2 3 >> 2 3 4 4 >> 3 1 2 3 >
i don't have any solution significantly better than what you have already been given. but i have a warning instead. in the below, you use both 'duplicated' and 'unique' on data frames, and the proposed solution relies on the latter. you may want to try to avoid both when working with data frames; this is because of how they do (or don't) work. duplicated (and unique, which calls duplicated) simply pastes the content of each row into a *string*, and then works on the strings. this means that NAs in the data frame are converted to "NA"s, and "NA" == "NA", obviously, so that rows that include NAs and are otherwise identical will be considered *identical*. that's not bad (yet), but you should be aware. however, duplicated has a parameter named 'incomparables', explained in ?duplicated as follows: " incomparables: a vector of values that cannot be compared. 'FALSE' is a special value, meaning that all values can be compared, and may be the only value accepted for methods other than the default. It will be coerced internally to the same type as 'x'. " and also " Values in 'incomparables' will never be marked as duplicated. This is intended to be used for a fairly small set of values and will not be efficient for a very large set. " that is, for example: vector = c(NA, NA) duplicated(vector) # [1] FALSE TRUE duplicated(vector), incomparables=NA) # [1] FALSE FALSE list = list(NA, NA) duplicated(list) # [1] FALSE TRUE duplicated(list, incomparables=NA) # [1] FALSE FALSE what the documentation *fails* to tell you is that the parameter 'incomparables' is defunct in duplicated.data.frame, which you can see in its source code (below), or in the following example: # data as above, or any data frame duplicated(data, incomparables=NA) # Error in if (!is.logical(incomparables) || incomparables) .NotYetUsed("incomparables != FALSE") : # missing value where TRUE/FALSE needed the error message here is *confusing*. the error is raised because the author of the code made a mistake and apparently haven't carefully examined and tested his product; the code goes: duplicated.data.frame # function (x, incomparables = FALSE, fromLast = FALSE, ...) # { # if (!is.logical(incomparables) || incomparables) # .NotYetUsed("incomparables != FALSE") # duplicated(do.call("paste", c(x, sep = "\r")), fromLast = fromLast) # } # <environment: namespace:base> clearly, the intention here is to raise an error with a (still hardly clear) message as in: .NotYetUsed("incomparables != FALSE") # Error: argument 'incomparables != FALSE' is not used (yet) but instead, if(NA) is evaluated (because '!is.logical(NA) || NA' evaluates, *obviously*, to NA) and hence the uninformative error message. take home point: rtfm, *but* don't believe it. vQ > Does this do what you want? > > x <- c(1,3,1) > > y <- c(2,4,2) > > z <- c(3,4,3) > > data <- data.frame(x,y,z) > > data.u <- unique(data) > > data.u > x y z > 1 1 2 3 > 2 3 4 4 > > data.u <- cbind(data.u, set = 1:nrow(data.u)) > > merge(data, data.u) > x y z set > 1 1 2 3 1 > 2 1 2 3 1 > 3 3 4 4 2 > > You need to do a bit more work to get them back into the original row > order if that is essential. > > > >> I can't figure out how to get R to tell me that observation 1 and 3 >> are the same. It seems like the "duplicated" and "unique" functions >> should be able to help me out, but I am stumped. >> >> For instance, if I use "duplicated" ... >> >> > duplicated(data) >> [1] FALSE FALSE TRUE >> >> it tells me that row 3 is a duplicate, but not which row it matches. >> How do I figure out WHICH row it matches? >> >> And If I use "unique"... >> >> > unique(data) >> x y z >> 1 1 2 3 >> 2 3 4 4 >> >> I see that rows 1 and 2 are unique, leaving me to infer that row 3 was >> a duplicate, but again it doesn't tell me which row it was a duplicate >> of (as far as I can tell). Am I missing something? >> >> How can I determine that row 3 is a duplicate OF ROW 1? >> >> Thanks, >> >> Aaron >> >> > > Michael Dewey > http://www.aghmed.fsnet.co.uk > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- ------------------------------------------------------------------------------- Wacek Kusnierczyk, MD PhD Email: w...@idi.ntnu.no Phone: +47 73591875, +47 72574609 Department of Computer and Information Science (IDI) Faculty of Information Technology, Mathematics and Electrical Engineering (IME) Norwegian University of Science and Technology (NTNU) Sem Saelands vei 7, 7491 Trondheim, Norway Room itv303 Bioinformatics & Gene Regulation Group Department of Cancer Research and Molecular Medicine (IKM) Faculty of Medicine (DMF) Norwegian University of Science and Technology (NTNU) Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway Room 231.05.060 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.