Sorry, I have seen it too late that we had different tab width in the original
file and my editor.
Here is the patch with all white spaces instead of mixing tabs and white spaces.
Serguei.
Le 29/05/2017 à 15:13, Serguei Sokol a écrit :
Here is an attached patch.
Best,
Serguei.
Le 29/05/2017 à 12:21, Serguei Sokol a écrit :
The problem or actual R implementation relies on an assumption
that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6)
which reveals not to be true despite very trustful appearance.
If we continue with the example of x=y=1:9
then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one)
while median(y[i] | x[i] <= quantile(x, 1/3))=2
On the other sample's side we've got 7.5 and 8 for x and y respectively.
Hence the slope (8-2)/(7.5-2.5)=1.2
To get a correct version of this, one should calculate x robust points in the
same way as the y's,
i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >=
quantile(x, 2/3))
Best,
Serguei.
Le 29/05/2017 à 10:02, peter dalgaard a écrit :
A usually trustworthy R correspondent posted a pure R implementation on SO at
some point in his lost youth:
https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line
This one does indeed generate the line of identity for the (1:9, 1:9) case, so
I do suspect that we have a genuine scr*wup in line().
Notice, incidentally, that
line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))
Call:
line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))
Coefficients:
[1] -0.9407 1.1948
I.e., it is not likely an issue with exact integers or perfect fit.
-pd
On 29 May 2017, at 07:21 , GlenB <glnbr...@gmail.com> wrote:
Tukey divides the points into three groups, not the x and y values
separately.
I'll try to get hold of the book for a direct quote, might take a couple
of days.
Ah well, I can't get it for a week. But the fact that it's often called
Tukey's three group line (try a search on *tukey three group line* and
you'll get plenty of hits) is pretty much a giveaway.
On Mon, May 29, 2017 at 2:19 PM, GlenB <glnbr...@gmail.com> wrote:
Tukey divides the points into three groups, not the x and y values
separately.
I'll try to get hold of the book for a direct quote, might take a couple
of days.
On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <murdoch.dun...@gmail.com>
wrote:
On 27/05/2017 9:28 PM, GlenB wrote:
Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
or
3
Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
intercept -1 and slope 1.2
Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
of
length 6, with four of every six correct.
Bug has been present across many versions.
The machine I just tried it on just now has R3.2.3:
If you look at the source (in src/library/stats/src/line.c), the
explanation is clear: the x value is chosen as the 1/6 quantile (according
to a particular definition of quantile), and the y value is chosen as the
median of the y values where x is less than or equal to the 1/3 quantile.
Those are different definitions (though I think they would be
asymptotically equivalent under pretty weak assumptions), so it's not
surprising the x value doesn't correspond perfectly to the y value, and the
line ends up "wrong".
So is it a bug? Well, that depends on Tukey's definition. I don't have
a copy of his book handy so I can't really say. Maybe the R function is
doing exactly what Tukey said it should, and that's not a bug. Or maybe R
is wrong.
Duncan Murdoch
[[alternative HTML version deleted]]
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
--
Serguei Sokol
Ingenieur de recherche INRA
Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)
LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04
tel: +33 5 6155 9276
fax: +33 5 6704 8825
email: so...@insa-toulouse.fr
http://metasys.insa-toulouse.fr
http://www.lisbp.fr
--- line.c.orig 2016-03-17 00:03:03.000000000 +0100
+++ line.c 2017-05-29 15:24:16.871861092 +0200
@@ -17,7 +17,7 @@
* https://www.R-project.org/Licenses/
*/
-#include <R_ext/Utils.h> /* R_rsort() */
+#include <R_ext/Utils.h> /* R_rsort() */
#include <math.h>
#include <Rinternals.h>
@@ -25,8 +25,8 @@
/* Speed up by `inlining' these (as macros) [since R version 1.2] : */
#if 1
-#define il(n,x) (int)floor((n - 1) * x)
-#define iu(n,x) (int) ceil((n - 1) * x)
+#define il(n,x) (int)floor(((n) - 1) * (x))
+#define iu(n,x) (int) ceil(((n) - 1) * (x))
#else
static int il(int n, double x)
@@ -41,69 +41,69 @@
#endif
static void line(double *x, double *y, /* input (x[i],y[i])s */
- double *z, double *w, /* work and output: resid. & fitted */
- /* all the above of length */ int n,
- double coef[2])
+ double *z, double *w, /* work and output: resid. & fitted */
+ /* all the above of length */ int n,
+ double coef[2])
{
int i, j, k;
double xb, x1, x2, xt, yt, yb, tmp1, tmp2;
double slope, yint;
for(i = 0 ; i < n ; i++) {
- z[i] = x[i];
- w[i] = y[i];
+ z[i] = x[i];
+ w[i] = y[i];
}
R_rsort(z, n);/* z = ordered abscissae */
- tmp1 = z[il(n, 1./6.)];
- tmp2 = z[iu(n, 1./6.)]; xb = 0.5*(tmp1+tmp2);
-
tmp1 = z[il(n, 2./6.)];
- tmp2 = z[iu(n, 2./6.)]; x1 = 0.5*(tmp1+tmp2);
+ k = iu(n, 2./6.);
+ tmp2 = z[k]; x1 = 0.5*(tmp1+tmp2);
tmp1 = z[il(n, 4./6.)];
- tmp2 = z[iu(n, 4./6.)]; x2 = 0.5*(tmp1+tmp2);
-
- tmp1 = z[il(n, 5./6.)];
- tmp2 = z[iu(n, 5./6.)]; xt = 0.5*(tmp1+tmp2);
+ tmp2 = z[iu(n, 4./6.)]; x2 = 0.5*(tmp1+tmp2);
slope = 0.;
for(j = 1 ; j <= 1 ; j++) {
- /* yb := Median(y[i]; x[i] <= quantile(x, 1/3) */
- k = 0;
- for(i = 0 ; i < n ; i++)
- if(x[i] <= x1)
- z[k++] = w[i];
- R_rsort(z, k);
- yb = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]);
-
- /* yt := Median(y[i]; x[i] >= quantile(x, 2/3) */
- k = 0;
- for(i = 0 ; i < n ; i++)
- if(x[i] >= x2)
- z[k++] = w[i];
- R_rsort(z,k);
- yt = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]);
-
- slope += (yt - yb)/(xt - xb);
- for(i = 0 ; i < n ; i++) {
- z[i] = y[i] - slope*x[i];
- /* never used: w[i] = z[i]; */
- }
- R_rsort(z,n);
- yint = 0.5 * (z[il(n, 0.5)] + z[iu(n, 0.5)]);
+ /* xb := Median(x[i]; x[i] <= quantile(x, 1/3) */
+ k += z[k] == x1;
+ xb = 0.5*(z[il(k, 0.5)] + z[iu(k, 0.5)]);
+ /* yb := Median(y[i]; x[i] <= quantile(x, 1/3) */
+ k = 0;
+ for(i = 0 ; i < n ; i++)
+ if(x[i] <= x1)
+ z[k++] = w[i];
+ R_rsort(z, k);
+ yb = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]);
+
+ /* yt := Median(y[i]; x[i] >= quantile(x, 2/3) */
+ k = 0;
+ for(i = 0 ; i < n ; i++)
+ if(x[i] >= x2)
+ z[k++] = w[i];
+ R_rsort(z,k);
+ yt = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]);
+ /* xt := Median(x[i]; x[i] >= quantile(x, 2/3) */
+ xt = 0.5 * (z[n - k + il(k, 0.5)] + z[n - k + iu(k, 0.5)]);
+
+ slope += (yt - yb)/(xt - xb);
+ for(i = 0 ; i < n ; i++) {
+ z[i] = y[i] - slope*x[i];
+ /* never used: w[i] = z[i]; */
+ }
+ R_rsort(z,n);
+ yint = 0.5 * (z[il(n, 0.5)] + z[iu(n, 0.5)]);
}
for( i = 0 ; i < n ; i++ ) {
- w[i] = yint + slope*x[i];
- z[i] = y[i] - w[i];
+ w[i] = yint + slope*x[i];
+ z[i] = y[i] - w[i];
}
coef[0] = yint;
coef[1] = slope;
}
void tukeyline0(double *x, double *y, double *z, double *w, int *n,
- double *coef)
+ double *coef)
{
line(x, y, z, w, *n, coef);
}
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel