On Tuesday, June 8, 2010, christiaan pauw <cjp...@gmail.com> wrote: > Hi everybody > > I have found something (for me at least) strange with duplicated(). I will > first provide a replicable example of a certain kind of behaviour that I > find odd and then give a sample of unexpected results from my own data. I > hope someone can help me understand this. > > Consider the following > > # this works as expected > > ex=sample(1:20, replace=TRUE) > > ex > > duplicated(ex) > > ex=sort(ex) > > ex > > duplicated(ex) > > > # but why does duplicate not work after order() ? > > ex=sample(1:20, replace=TRUE) > > ex > > duplicated(ex) > > ex=order(ex) > > duplicated(ex) > > Why does duplicated() not work after order() has been applied but it works > fine after sort() ? Is this an error or is there something I don't > understand.
The latter: order() returns the indexes into your vector, i.e. a permutation, which select the values in a sorted order. Each element is unique by definition. > > I have been getting very strage results from duplicated() and unique() in a > dataset I am analysing. Her is a little sample of my real life problem presumably this is a data.frame... > >> str(Masechaba$PROPDESC) > Factor w/ 24545 levels " 06"," 71Hemilton str",..: 14527 8043 16113 > 16054 13875 15780 12522 7771 14824 12314 ... >> # Create a indicator if the PROPDESC is unique. Default false >> Masechaba$unique=FALSE >> Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE The statement above is in error. You are referring to elements of unique(Masechaba$PROPDESC) which do not correspond to the rows of Masechaba. They are different lengths. Use duplicated() instead. >> # Check is something happended >> length(which(Masechaba$unique==TRUE)) > [1] 2174 >> length(which(Masechaba$unique==FALSE)) > [1] 476 >> Masechaba$duplicate=FALSE >> Masechaba$duplicate[which(duplicated(Masechaba$PROPDESC)==TRUE)]=TRUE equivalent to Masechaba$duplicate <- duplicated(Masechaba$PROPDESC) >> length(which(Masechaba$duplicate==TRUE)) > [1] 476 >> length(which(Masechaba$duplicate==FALSE)) > [1] 2174 >> # Looks OK so far >> # Test on a known duplicate. I expect one to be true and one to be false >> Masechaba[which(Masechaba$PROPDESC==2363),10:12] > PROPDESC unique duplicate > 24874 2363 TRUE FALSE > 31280 2363 TRUE TRUE > > # This is strange. I expected that unique() and duplicate() would give the > same results. The variable PROPDESC is clearly not unique in both cases. > # The totals are the same but not the individual results >> table(Masechaba$unique,Masechaba$duplicate) > > FALSE TRUE > FALSE 342 134 > TRUE 1832 342 > > I don't understand this. Is there something I am missing? > > Best regards > Christaan > > > P.S >> sessionInfo() > R version 2.11.1 (2010-05-31) > x86_64-apple-darwin9.8.0 > > locale: > [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8 > > attached base packages: > [1] splines stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] plyr_0.1.9 maptools_0.7-34 lattice_0.18-8 foreign_0.8-40 > Hmisc_3.8-0 survival_2.35-8 rgdal_0.6-26 > [8] sp_0.9-64 > > loaded via a namespace (and not attached): > [1] cluster_1.12.3 grid_2.11.1 tools_2.11.1 > > [[alternative HTML version deleted]] > > ______________________________________________ > r-h...@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Felix Andrews / 安福立 Integrated Catchment Assessment and Management (iCAM) Centre Fenner School of Environment and Society [Bldg 48a] The Australian National University Canberra ACT 0200 Australia M: +61 410 400 963 T: + 61 2 6125 4670 E: felix.andr...@anu.edu.au CRICOS Provider No. 00120C -- http://www.neurofractal.org/felix/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.