You can find out which rows of a data.frame called dataFrame are duplicates of previous rows with dups <- duplicated(dataFrame) To make a new data.frame without them do duplessDataFrame <- dataFrame[!dups, ] You could use unique(dataFrame), but, as in your examples, I think one often wants to remove duplicates based on only some of the columns. E.g., with the following data.frame dataFrame <- data.frame(Name=LETTERS[1:9], One=rep(1:3,3), Two=c(11,12,13,11,11,12,12,13,13), Three=c(101,102,103,101,101,103,101,102,103)) we get > dataFrame Name One Two Three 1 A 1 11 101 2 B 2 12 102 3 C 3 13 103 4 D 1 11 101 5 E 2 11 101 6 F 3 12 103 7 G 1 12 101 8 H 2 13 102 9 I 3 13 103 > duplicated(dataFrame) [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE > dups123 <- duplicated(dataFrame[,c("One","Two","Three")]) > dups123 [1] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE TRUE > dataFrame[!dups123, ] Name One Two Three 1 A 1 11 101 2 B 2 12 102 3 C 3 13 103 5 E 2 11 101 6 F 3 12 103 7 G 1 12 101 8 H 2 13 102
Your first expression detail3 <- [!duplicated(...)] must have caused a syntax error, as "[" is the subscript operator and requires something before it, as in datail2[...]. To see why your second attempt detail3 <- unique(detail2[,c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH, detail2$BEGTIME, detail2$ENDTIME,detail2$OTYPE,detail2$OCOND, detail2$ACCTYP ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS, detail2$STKFUL)]) will not do what you want (even if it did finish in a reasonable amount of time) break it into pieces and use the example dataset above. You asked it to extract the columns specified by 'tmp' where 'tmp' was constructed by: > print(tmp <- c(dataFrame$One, dataFrame$Two, dataFrame$Three)) [1] 1 2 3 1 2 3 1 2 3 11 12 [12] 13 11 11 12 12 13 13 101 102 103 101 [23] 101 103 101 102 103 Then dataFrame[, tmp] is asking it to make a 27-column data.frame based on those columns (which don't exist in the original 4-column data.frame). You should have gotten an 'undefined columns selected' error. Perhaps it ran out of memory while checking all 184K * 13 columns. That would be odd. Now if you used the calls I mentioned at first (in the working example) and R hung, there might be ways to speed up the process. Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf > Of ramoss > Sent: Wednesday, August 29, 2012 1:58 PM > To: r-help@r-project.org > Subject: [R] Deduping in R by multiple variables > > I have a dataset w/ 184K obs & 16 variables. In SAS I proc sort nodupkey it > in seconds by 11 variables. > I tried to do the same thing in R using both the unique & then the > !duplicated functions but it just hangs there & I get no output. Does > anyone know how to solve this? > > This is how I tried to do it in R: > > > detail3 <- > [!duplicated(c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH, > detail2$BEGTIME, > detail2$ENDTIME,detail2$OTYPE,detail2$OCOND, > detail2$ACCTYP > ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS, > detail2$STKFUL)),] > > detail3 <- > unique(detail2[,c(detail2$TDATE,detail2$FIRM,detail2$CM,detail2$BRANCH, > detail2$BEGTIME, detail2$ENDTIME,detail2$OTYPE,detail2$OCOND, > detail2$ACCTYP ,detail2$OSIDE,detail2$SHARES,detail2$STOCKS, > detail2$STKFUL)]) > > > > > Thanks in advance > > > > -- > View this message in context: http://r.789695.n4.nabble.com/Deduping-in-R-by- > multiple-variables-tp4641778.html > Sent from the R help mailing list archive at Nabble.com. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.