[Rd] Errors in wilcox family functions

Aidan Lakshman Mon, 12 Feb 2024 13:15:59 -0800

Hi Everyone,

Following the previous discussion on optimizing *wilcox functions, Andreas 
Loeffler brought to my attention a few other bugs in `wilcox` family functions. 
It seems like these issues have been discussed online in the past few months, 
but I haven’t seen discussion on R-devel...unless I missed an email, it seems 
like discussion never made it to the mailing list. I haven’t seen any bug 
reports on Bugzilla to this effect either, so I’m emailing the list to start 
discussion with both Andreas and Andrey cc’d so they can provide additional 
detail.

Most of these issues have been raised by Andrey Akinshin on his blog, which I
will link throughout for reproducible examples and code rather than making this
email even longer than it already is. Of the following points, (1-2) could be
considered bugs, and (3-4) enhancements. I wanted to reach out to determine if
R-devel was already aware of these, and if so, if people have already been
working on them.

The current issues are the following:

1. `wilcox.test` returns very distorted p-values when `exact=FALSE`, especially
at the tails of the distribution. These errors are due to inaccuracy in the
normal approximation. While there will obviously be errors using any
approximation, it is possible to use an Edgeworth Expansion correction to
uniformly decreases the error in p-value approximation without substantially
rewriting the internals. An example patch and benchmarks are available at
https://github.com/ahl27/R_Patches/tree/94e8e0bcf5076841637f1031ea9cf456ad18d7ef/EdgeworthExpansion.

More verbose details, examples, and benchmarks:
- https://aakinshin.net/posts/r-mann-whitney-incorrect-p-value/
- https://aakinshin.net/posts/mw-edgeworth/

2. The built-in Hodges-Lehmann estimator has some edge cases that result in
incorrect answers without suitable warnings. These are detailed extensively at
the links below. In short, the following cases result in an incorrect value for
`wilcox.test(..., conf.int=TRUE)$estimate`:
- `x` has zero values: these are trimmed before the median is calculated, so
`x=c(0,1,2)` returns a median of 1.5.
- tied values in either one- or two-sample tests: these can force R to use a
normal approximation even in low-sample cases.
- degenerate two-sample tests (`max(x)==min(x)` and `max(y)==min(y)`): this
produces an error due to an unhandled edge case.

This particular issue has caused problems for the DescTools package, which
recently reimplemented HodgesLehmann due to inaccurate results in base R. At
the very least, warnings should probably be added so that users don’t
misinterpret these results. Better still would be to just fix these cases, but
I haven’t dug super deep into the codebase yet, so I’m not completely sure how
difficult that would be.

Details and examples:
- https://aakinshin.net/posts/r-hodges-lehmann-problems/
- Discussion in DescTools
(https://github.com/AndriSignorell/DescTools/issues/97)

3. `*signrank` functions hang for large values of `n`. The exact value varies,
but tends to be around `n>=1000`. `wilcox.test` supports an `exact`
argument—should an inexact approximation be implemented for *signrank functions
with `n>=1000`? An Edgeworth approximation could be similarly leveraged here to
return results in a reasonable manner.

Full writeup: https://aakinshin.net/posts/signrank-limitations/

4. Suggestions for updating tie correction in Mann-Whitney U tests. Andrey has
a very extensive writeup of the cases that make tie correction sometimes
unintuitive, and I’m just going to link it here:
https://aakinshin.net/posts/mw-confusing-tie-correction/.

If these seem like they’d be welcome improvements to R, I’ll work with Andrey
on putting some patches up on Bugzilla this week. If people have already known
about these and discussed them and I just somehow missed it, then I’m very
sorry for the verbose email.

Thanks again to Andrey for the writeup, and Andreas for pointing me to it.

-Aidan

-----------------------
Aidan Lakshman (he/him)
www.AHL27.com

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Errors in wilcox family functions

Reply via email to