I don't think that this would be considered a bug. The
reason for the discrepancy between use="complete.obs"
and use="pairwise.complete.obs" for the case of the
Spearman correlation of two vectors x, y is this:
"pairwise" does complete.cases(x,y) and then ranks;
this is also what's done in cor.test().
"complete" ranks first (keeping NAs via the
na.last="keep" argument to rank()) and then does
complete.cases(ranked.x,ranked.y) on the ranked data.
This can obviously lead to a different set of
ranks being correlated than those for "pairwise".
I must admit that I wasn't aware that R does this
and I don't know the rationale for it. The help page
says:
If use is "complete.obs" then missing values are
handled by casewise deletion ...
which is not clear on the order of ranking and
deletion, but further down the page:
Note that "spearman" basically computes cor(R(x), R(y))
(or cov(.,.)) where R(u) := rank(u, na.last="keep").
In the case of missing values, the ranks are calculated
depending on the value of use, either based on complete
observations, or based on pairwise completeness with
reranking for each pair.
I guess that this implies that, for "complete", the ranking
occurs before the casewise deletion (else why the
na.last="keep"?).
If anyone knows the rationale and/or can give a reference,
I'd be glad to receive such.
-Peter Ehlers
On 2010-06-09 11:36, hugh.ge...@thomsonreuters.com wrote:
Arrrrr,
I think I've found a bug in the behavior of the stats::cor function when
NAs are present, but in case I'm missing something, could you look over
this example and let me know what you think:
a = c(1,3,NA,1,2)
b = c(1,2,1,1,4)
cor(a,b,method="spearman", use="complete.obs")
[1] 0.8164966
cor(a,b,method="spearman", use="pairwise.complete.obs")
[1] 0.7777778
My understanding is that, when the inputs are vectors (but not
necessarily when they're matrices), the "complete.obs" and
"pairwise.complete.obs" arguments should give identical spearman
correlations. The above example clearly shows they do not in my version
of R (2.11.1). However, in cor.test, they do:
cor.test(a,b,method="spearman", use="complete.obs")
Spearman's rank correlation rho
data: a and b
S = 2.2222, p-value = 0.2222
alternative hypothesis: true rho is not equal to 0
sample estimates:
rho
0.7777778
So cor and cor.test do not agree, which seems very likely to be a bug.
When calculating by hand, I also get 0.7777778. Additionally, when
using an old version of R (2.5.0), both the complete.obs and
pairwise.complete.obs versions give 0.7777778. Which strongly suggests
either 2.5.0 or 2.11.1 has a bug in it. Is this a bug? If so, has it
already been reported? (I found a related but confusing email thread
from 2004 in the R archives, but I did not find the resolution to that
bug report).
Additional info:
Platform = Windows XP
sessionInfo()
R version 2.11.1 (2010-05-31)
i386-pc-mingw32
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United
States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] tools_2.11.1
Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Thanks,
--Hugh
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.