Hi All: The p-value of Spearman's rank correlation test is calculated using algorithm AS 89. However, the way how AS 89 is used incures error, which may be an order of magnitude larger than the error of the original algorithm.
In case of no ties AS 89 expects an even value of the statistic S = sum((rank(x)-rank(y))^2) Clearly, S is even, since there is an even number of odd differences. We will also need the fact that S ranges between 0 and 2*n*(n^2 - 1)/6, where n = length(x) = length(y). In some situations, R calls AS 89 with an odd value of S. Namely, if S < n*(n^2 - 1)/6 (this corresponds to a positive correlation), then the R call of AS 89 uses as input for computing the lower tail the odd value S+1 instead of the correct S+2 (see the explanation below why this makes a difference). For values S > n*(n^2 - 1)/6 (negative correlation), the R call uses the upper tail of the distribution with input S, which is correct. The fact that in case of no ties AS 89 expects an even S may be seen in the C code. The file R-patched/src/library/stats/src/prho.c, lines 90 - 93 contains the following comment ("is" is the parametr containing the value of S) /* NOT rounding to even anymore: with ties, S, may even be non-integer! * js = *is; * if(fmod(js, 2.) != 0.) ++js; */ Older versions of the algorithm (the original fortran code is available at http://lib.stat.cmu.edu/apstat/89) performed a correction by increasing a possible odd value of "is" by one to an even number. Since now the correction is removed, the calling program should guarantee an even value of S in case of no ties. The original AS 89 calculates only the upper tail. R uses a modification, where the lower tail is computed in the C code as the complement of the upper tail. Hence, in order to get P(S <= s), we can call AS 89 with the input s+2 and ask for the complement of the upper tail, since P(S <= s) = P(S < s+2) = 1 - P(S >= s+2), where S denotes the random variable and s is an even number. For the true distribution, we could equivalently use P(S <= s) = P(S < s+1) = 1 - P(S >= s+1), since P(S >= s+1) = P(S >= s+2). However, the approximation is a continuous function and its value in s+1 is roughly a linear interpolation between its values in s and s+2, so it is usually different from its value in s+2. Since the approximation is done under the assumption that the input is even, the correct approximation is P(S >= s+2). The following table contains the maximum absolute error of the original algorithm and of the R implementation for n = 12, ... ,22. For each n, the maximum error is computed over those values of S which satisfy S < n*(n^2 - 1)/6 (positive correlation) and for which the correct p-value of the two-sided test is in the range [0.01, 0.1]. n err.orig err.R err.R/err.orig 12 3.83e-04 2.75e-03 7.18 13 2.24e-04 2.10e-03 9.39 14 1.43e-04 1.75e-03 12.24 15 9.52e-05 1.46e-03 15.32 16 6.85e-05 1.26e-03 18.45 17 5.18e-05 1.08e-03 20.89 18 4.07e-05 9.40e-04 23.09 19 3.28e-05 8.11e-04 24.73 20 2.66e-05 7.11e-04 26.73 21 2.16e-05 6.31e-04 29.22 22 1.77e-05 5.62e-04 31.77 On the side of negative correlation (S > n*(n^2 - 1)/6), there is no difference between the original algorithm and its R implementation. The table of the maximum errors for the same range of p-values on the side of negative correlation is as follows n err.orig err.R err.R/err.orig 12 3.83e-04 3.83e-04 1 13 2.24e-04 2.24e-04 1 14 1.43e-04 1.43e-04 1 15 9.52e-05 9.52e-05 1 16 6.85e-05 6.85e-05 1 17 5.18e-05 5.18e-05 1 18 4.07e-05 4.07e-05 1 19 3.28e-05 3.28e-05 1 20 2.66e-05 2.66e-05 1 21 2.16e-05 2.16e-05 1 22 1.77e-05 1.77e-05 1 The value n = 22 is the largest n, for which I have the exact probabilities. A patch correcting the problem follows. It is indented with 2 spaces. An intact version of the patch is in an attachment. diff --minimal -U 4 -r R-patched/src/library/stats/R/cor.test.R R-cor.test/src/library/stats/R/cor.test.R --- R-patched/src/library/stats/R/cor.test.R 2008-10-21 10:29:37.000000000 +0200 +++ R-cor.test/src/library/stats/R/cor.test.R 2009-01-26 16:58:59.123291744 +0100 @@ -151,9 +151,9 @@ pspearman <- function(q, n, lower.tail = TRUE) { if(n <= 1290 && exact) # n*(n^2 - 1) does not overflow .C("prho", as.integer(n), - as.double(round(q) + lower.tail), + as.double(round(q) + 2*lower.tail), p = double(1), integer(1), as.logical(lower.tail), PACKAGE = "stats")$p Petr.
diff --minimal -U 4 -r R-patched/src/library/stats/R/cor.test.R R-cor.test/src/library/stats/R/cor.test.R --- R-patched/src/library/stats/R/cor.test.R 2008-10-21 10:29:37.000000000 +0200 +++ R-cor.test/src/library/stats/R/cor.test.R 2009-01-26 16:58:59.123291744 +0100 @@ -151,9 +151,9 @@ pspearman <- function(q, n, lower.tail = TRUE) { if(n <= 1290 && exact) # n*(n^2 - 1) does not overflow .C("prho", as.integer(n), - as.double(round(q) + lower.tail), + as.double(round(q) + 2*lower.tail), p = double(1), integer(1), as.logical(lower.tail), PACKAGE = "stats")$p
______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel