Here is an attached patch.

Best,
Serguei.

Le 29/05/2017 à 12:21, Serguei Sokol a écrit :
The problem or actual R implementation relies on an assumption
that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6)
which reveals not to be true despite very trustful appearance.

If we continue with the example of x=y=1:9
then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one)
while median(y[i] | x[i] <= quantile(x, 1/3))=2
On the other sample's side we've got 7.5 and 8 for x and y respectively.
Hence the slope (8-2)/(7.5-2.5)=1.2

To get a correct version of this, one should calculate x robust points in the 
same way as the y's,
i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= 
quantile(x, 2/3))

Best,
Serguei.

Le 29/05/2017 à 10:02, peter dalgaard a écrit :
A usually trustworthy R correspondent posted a pure R implementation on SO at 
some point in his lost youth:

https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line

This one does indeed generate the line of identity for the (1:9, 1:9) case, so 
I do suspect that we have a genuine scr*wup in line().

Notice, incidentally, that

line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))
Call:
line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))

Coefficients:
[1]  -0.9407   1.1948

I.e., it is not likely an issue with exact integers or perfect fit.

-pd



On 29 May 2017, at 07:21 , GlenB <glnbr...@gmail.com> wrote:

Tukey divides the points into three groups, not the x and y values
separately.

I'll try to get hold of the book for a direct quote, might take a couple
of days.

Ah well, I can't get it for a week. But the fact that it's often called
Tukey's three group line (try a search on *tukey three group line* and
you'll get plenty of hits) is pretty much a giveaway.


On Mon, May 29, 2017 at 2:19 PM, GlenB <glnbr...@gmail.com> wrote:

Tukey divides the points into three groups, not the x and y values
separately.

I'll try to get hold of the book for a direct quote, might take a couple
of days.



On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <murdoch.dun...@gmail.com>
wrote:

On 27/05/2017 9:28 PM, GlenB wrote:

Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
or
3

Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
intercept -1 and slope 1.2

Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
of
length 6, with four of every six correct.

Bug has been present across many versions.

The machine I just tried it on just now has R3.2.3:

If you look at the source (in src/library/stats/src/line.c), the
explanation is clear:  the x value is chosen as the 1/6 quantile (according
to a particular definition of quantile), and the y value is chosen as the
median of the y values where x is less than or equal to the 1/3 quantile.
Those are different definitions (though I think they would be
asymptotically equivalent under pretty weak assumptions), so it's not
surprising the x value doesn't correspond perfectly to the y value, and the
line ends up "wrong".

So is it a bug?  Well, that depends on Tukey's definition.  I don't have
a copy of his book handy so I can't really say.  Maybe the R function is
doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
is wrong.

Duncan Murdoch


    [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Serguei Sokol
Ingenieur de recherche INRA
Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)

LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04

tel: +33 5 6155 9276
fax: +33 5 6704 8825
email: so...@insa-toulouse.fr
http://metasys.insa-toulouse.fr
http://www.lisbp.fr

--- line.c.orig	2016-03-17 00:03:03.000000000 +0100
+++ line.c	2017-05-29 15:06:55.727508967 +0200
@@ -25,8 +25,8 @@
 
 /* Speed up by `inlining' these (as macros) [since R version 1.2] : */
 #if 1
-#define il(n,x)	(int)floor((n - 1) * x)
-#define iu(n,x)	(int) ceil((n - 1) * x)
+#define il(n,x)	(int)floor(((n) - 1) * (x))
+#define iu(n,x)	(int) ceil(((n) - 1) * (x))
 
 #else
 static int il(int n, double x)
@@ -50,53 +50,53 @@
     double slope, yint;
 
     for(i = 0 ; i < n ; i++) {
-	z[i] = x[i];
-	w[i] = y[i];
+		z[i] = x[i];
+		w[i] = y[i];
     }
     R_rsort(z, n);/* z = ordered abscissae */
 
-    tmp1 = z[il(n, 1./6.)];
-    tmp2 = z[iu(n, 1./6.)];	xb = 0.5*(tmp1+tmp2);
-
     tmp1 = z[il(n, 2./6.)];
-    tmp2 = z[iu(n, 2./6.)];	x1 = 0.5*(tmp1+tmp2);
+    k = iu(n, 2./6.);
+    tmp2 = z[k];	x1 = 0.5*(tmp1+tmp2);
 
     tmp1 = z[il(n, 4./6.)];
     tmp2 = z[iu(n, 4./6.)];	x2 = 0.5*(tmp1+tmp2);
 
-    tmp1 = z[il(n, 5./6.)];
-    tmp2 = z[iu(n, 5./6.)];	xt = 0.5*(tmp1+tmp2);
-
     slope = 0.;
 
     for(j = 1 ; j <= 1 ; j++) {
-	/* yb := Median(y[i]; x[i] <= quantile(x, 1/3) */
-	k = 0;
-	for(i = 0 ; i < n ; i++)
-	    if(x[i] <= x1)
-		z[k++] = w[i];
-	R_rsort(z, k);
-	yb = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]);
-
-	/* yt := Median(y[i]; x[i] >= quantile(x, 2/3) */
-	k = 0;
-	for(i = 0 ; i < n ; i++)
-	    if(x[i] >= x2)
-		z[k++] = w[i];
-	R_rsort(z,k);
-	yt = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]);
-
-	slope += (yt - yb)/(xt - xb);
-	for(i = 0 ; i < n ; i++) {
-	    z[i] = y[i] - slope*x[i];
-	    /* never used: w[i] = z[i]; */
-	}
-	R_rsort(z,n);
-	yint = 0.5 * (z[il(n, 0.5)] + z[iu(n, 0.5)]);
+		/* xb := Median(x[i]; x[i] <= quantile(x, 1/3) */
+		k += z[k] == x1;
+		xb = 0.5*(z[il(k, 0.5)] + z[iu(k, 0.5)]);
+		/* yb := Median(y[i]; x[i] <= quantile(x, 1/3) */
+		k = 0;
+		for(i = 0 ; i < n ; i++)
+			if(x[i] <= x1)
+				z[k++] = w[i];
+		R_rsort(z, k);
+		yb = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]);
+
+		/* yt := Median(y[i]; x[i] >= quantile(x, 2/3) */
+		k = 0;
+		for(i = 0 ; i < n ; i++)
+			if(x[i] >= x2)
+				z[k++] = w[i];
+		R_rsort(z,k);
+		yt = 0.5 * (z[il(k, 0.5)] + z[iu(k, 0.5)]);
+		/* xt := Median(x[i]; x[i] >= quantile(x, 2/3) */
+		xt = 0.5 * (z[n - k + il(k, 0.5)] + z[n - k + iu(k, 0.5)]);
+
+		slope += (yt - yb)/(xt - xb);
+		for(i = 0 ; i < n ; i++) {
+			z[i] = y[i] - slope*x[i];
+			/* never used: w[i] = z[i]; */
+		}
+		R_rsort(z,n);
+		yint = 0.5 * (z[il(n, 0.5)] + z[iu(n, 0.5)]);
     }
     for( i = 0 ; i < n ; i++ ) {
-	w[i] = yint + slope*x[i];
-	z[i] = y[i] - w[i];
+		w[i] = yint + slope*x[i];
+		z[i] = y[i] - w[i];
     }
     coef[0] = yint;
     coef[1] = slope;
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to