Hi everybody

I have found something (for me at least) strange with duplicated(). I will
first provide a replicable example of a certain kind of behaviour that I
find odd and then give a sample of unexpected results from my own data. I
hope someone can help me understand this.

Consider the following

# this works as expected

ex=sample(1:20, replace=TRUE)

ex

duplicated(ex)

ex=sort(ex)

ex

duplicated(ex)


# but why does duplicate not work after order() ?

ex=sample(1:20, replace=TRUE)

ex

duplicated(ex)

ex=order(ex)

duplicated(ex)

Why does duplicated() not work after order() has been applied but it works
fine after sort()  ? Is this an error or is there something I don't
understand.

I have been getting very strage results from duplicated() and unique() in a
dataset I am analysing. Her is a little sample of my real life problem

> str(Masechaba$PROPDESC)
 Factor w/ 24545 levels "     06","   71Hemilton str",..: 14527 8043 16113
16054 13875 15780 12522 7771 14824 12314 ...
> # Create a indicator if the PROPDESC is unique. Default false
> Masechaba$unique=FALSE
> Masechaba$unique[which(is.na(unique(Masechaba$PROPDESC))==FALSE)]=TRUE
> # Check is something happended
> length(which(Masechaba$unique==TRUE))
[1] 2174
> length(which(Masechaba$unique==FALSE))
[1] 476
> Masechaba$duplicate=FALSE
> Masechaba$duplicate[which(duplicated(Masechaba$PROPDESC)==TRUE)]=TRUE
> length(which(Masechaba$duplicate==TRUE))
[1] 476
> length(which(Masechaba$duplicate==FALSE))
[1] 2174
> # Looks OK so far
> # Test on a known duplicate. I expect one to be true and one to be false
> Masechaba[which(Masechaba$PROPDESC==2363),10:12]
      PROPDESC unique duplicate
24874     2363   TRUE     FALSE
31280     2363   TRUE      TRUE

# This is strange.  I expected that unique() and duplicate() would give the
same results. The variable PROPDESC is clearly not unique in both cases.
# The totals are the same but not the individual results
> table(Masechaba$unique,Masechaba$duplicate)

        FALSE TRUE
  FALSE   342  134
  TRUE   1832  342

I don't understand this. Is there something I am missing?

Best regards
Christaan


P.S
> sessionInfo()
R version 2.11.1 (2010-05-31)
x86_64-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] splines   stats     graphics  grDevices utils     datasets  methods
base

other attached packages:
[1] plyr_0.1.9      maptools_0.7-34 lattice_0.18-8  foreign_0.8-40
 Hmisc_3.8-0     survival_2.35-8 rgdal_0.6-26
[8] sp_0.9-64

loaded via a namespace (and not attached):
[1] cluster_1.12.3 grid_2.11.1    tools_2.11.1

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to