Hi r-help-boun...@r-project.org napsal dne 08.06.2010 08:44:39:
> Hi everybody > > I have found something (for me at least) strange with duplicated(). I will > first provide a replicable example of a certain kind of behaviour that I > find odd and then give a sample of unexpected results from my own data. I > hope someone can help me understand this. > > Consider the following > > # this works as expected > > ex=sample(1:20, replace=TRUE) > > ex > > duplicated(ex) > > ex=sort(ex) This is OK as sort sorts your data > > ex > > duplicated(ex) > > > # but why does duplicate not work after order() ? > > ex=sample(1:20, replace=TRUE) > > ex > > duplicated(ex) > > ex=order(ex) This is not as order gives you positions not your data > ex=sample(letters[1:5],20, replace=TRUE) > ex [1] "b" "b" "b" "e" "d" "c" "e" "a" "a" "d" "d" "d" "a" "e" "b" "c" "e" "d" "a" [20] "a" > ex<-order(ex) > ex [1] 8 9 13 19 20 1 2 3 15 6 16 5 10 11 12 18 4 7 14 17 > ex=ex[order(ex)] shall give you the same result as sort. Maybe with exception of ties. > > duplicated(ex) > > Why does duplicated() not work after order() has been applied but it works > fine after sort() ? Is this an error or is there something I don't > understand. > > I have been getting very strage results from duplicated() and unique() in a > dataset I am analysing. Her is a little sample of my real life problem > > > str(Masechaba$PROPDESC) > Factor w/ 24545 levels " 06"," 71Hemilton str",..: 14527 8043 16113 > 16054 13875 15780 12522 7771 14824 12314 ... > > # Create a indicator if the PROPDESC is unique. Default false > > Masechaba$unique=FALSE > > Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE > > # Check is something happended > > length(which(Masechaba$unique==TRUE)) > [1] 2174 > > length(which(Masechaba$unique==FALSE)) > [1] 476 > > Masechaba$duplicate=FALSE > > Masechaba$duplicate[which(duplicated(Masechaba$PROPDESC)==TRUE)]=TRUE > > length(which(Masechaba$duplicate==TRUE)) > [1] 476 > > length(which(Masechaba$duplicate==FALSE)) > [1] 2174 > > # Looks OK so far > > # Test on a known duplicate. I expect one to be true and one to be false > > Masechaba[which(Masechaba$PROPDESC==2363),10:12] > PROPDESC unique duplicate > 24874 2363 TRUE FALSE > 31280 2363 TRUE TRUE > > # This is strange. I expected that unique() and duplicate() would give the > same results. The variable PROPDESC is clearly not unique in both cases. No. ex=sample(letters[1:5],10, replace=TRUE) ex [1] "b" "d" "d" "b" "a" "c" "b" "c" "d" "d" unique(ex) [1] "b" "d" "a" "c" duplicated(ex) [1] FALSE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE Functions give you different answers about your data as you ask different questions. > > Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This seems to be strange. At first sight I am puzzlet what result I shall expect from such construction. Regards Petr > # The totals are the same but not the individual results > > table(Masechaba$unique,Masechaba$duplicate) > > FALSE TRUE > FALSE 342 134 > TRUE 1832 342 > > I don't understand this. Is there something I am missing? > > Best regards > Christaan > > > P.S > > sessionInfo() > R version 2.11.1 (2010-05-31) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] plyr_0.1.9 maptools_0.7-34 lattice_0.18-8 foreign_0.8-40 > Hmisc_3.8-0 survival_2.35-8 rgdal_0.6-26 > [8] sp_0.9-64 > > loaded via a namespace (and not attached): > [1] cluster_1.12.3 grid_2.11.1 tools_2.11.1 > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.