date:20170529

Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

2017-05-29 Thread peter dalgaard

A usually trustworthy R correspondent posted a pure R implementation on SO at 
some point in his lost youth:

https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line

This one does indeed generate the line of identity for the (1:9, 1:9) case, so 
I do suspect that we have a genuine scr*wup in line().

Notice, incidentally, that

> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))

Call:
line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))

Coefficients:
[1]  -0.9407   1.1948

I.e., it is not likely an issue with exact integers or perfect fit.

-pd



> On 29 May 2017, at 07:21 , GlenB  wrote:
> 
>> Tukey divides the points into three groups, not the x and y values
> separately.
> 
>> I'll try to get hold of the book for a direct quote, might take a couple
> of days.
> 
> Ah well, I can't get it for a week. But the fact that it's often called
> Tukey's three group line (try a search on *tukey three group line* and
> you'll get plenty of hits) is pretty much a giveaway.
> 
> 
> On Mon, May 29, 2017 at 2:19 PM, GlenB  wrote:
> 
>> Tukey divides the points into three groups, not the x and y values
>> separately.
>> 
>> I'll try to get hold of the book for a direct quote, might take a couple
>> of days.
>> 
>> 
>> 
>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch 
>> wrote:
>> 
>>> On 27/05/2017 9:28 PM, GlenB wrote:
>>> 
 Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
 or
 3
 
 Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
 intercept -1 and slope 1.2
 
 Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
 of
 length 6, with four of every six correct.
 
 Bug has been present across many versions.
 
 The machine I just tried it on just now has R3.2.3:
 
>>> 
>>> If you look at the source (in src/library/stats/src/line.c), the
>>> explanation is clear:  the x value is chosen as the 1/6 quantile (according
>>> to a particular definition of quantile), and the y value is chosen as the
>>> median of the y values where x is less than or equal to the 1/3 quantile.
>>> Those are different definitions (though I think they would be
>>> asymptotically equivalent under pretty weak assumptions), so it's not
>>> surprising the x value doesn't correspond perfectly to the y value, and the
>>> line ends up "wrong".
>>> 
>>> So is it a bug?  Well, that depends on Tukey's definition.  I don't have
>>> a copy of his book handy so I can't really say.  Maybe the R function is
>>> doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
>>> is wrong.
>>> 
>>> Duncan Murdoch
>>> 
>>> 
>> 
> 
>   [[alternative HTML version deleted]]
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

-- 
Peter Dalgaard, Professor,
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Office: A 4.23
Email: pd@cbs.dk  Priv: pda...@gmail.com

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

2017-05-29 Thread Serguei Sokol


The problem or actual R implementation relies on an assumption
that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6)
which reveals not to be true despite very trustful appearance.

If we continue with the example of x=y=1:9
then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one)
while median(y[i] | x[i] <= quantile(x, 1/3))=2
On the other sample's side we've got 7.5 and 8 for x and y respectively.
Hence the slope (8-2)/(7.5-2.5)=1.2

To get a correct version of this, one should calculate x robust points in the 
same way as the y's,
i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= 
quantile(x, 2/3))

Best,
Serguei.

Le 29/05/2017 à 10:02, peter dalgaard a écrit :

A usually trustworthy R correspondent posted a pure R implementation on SO at 
some point in his lost youth:

https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line

This one does indeed generate the line of identity for the (1:9, 1:9) case, so 
I do suspect that we have a genuine scr*wup in line().

Notice, incidentally, that


line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))

Call:
line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))

Coefficients:
[1]  -0.9407   1.1948

I.e., it is not likely an issue with exact integers or perfect fit.

-pd




On 29 May 2017, at 07:21 , GlenB  wrote:


Tukey divides the points into three groups, not the x and y values

separately.


I'll try to get hold of the book for a direct quote, might take a couple

of days.

Ah well, I can't get it for a week. But the fact that it's often called
Tukey's three group line (try a search on *tukey three group line* and
you'll get plenty of hits) is pretty much a giveaway.


On Mon, May 29, 2017 at 2:19 PM, GlenB  wrote:


Tukey divides the points into three groups, not the x and y values
separately.

I'll try to get hold of the book for a direct quote, might take a couple
of days.



On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch 
wrote:


On 27/05/2017 9:28 PM, GlenB wrote:


Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
or
3

Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
intercept -1 and slope 1.2

Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
of
length 6, with four of every six correct.

Bug has been present across many versions.

The machine I just tried it on just now has R3.2.3:


If you look at the source (in src/library/stats/src/line.c), the
explanation is clear:  the x value is chosen as the 1/6 quantile (according
to a particular definition of quantile), and the y value is chosen as the
median of the y values where x is less than or equal to the 1/3 quantile.
Those are different definitions (though I think they would be
asymptotically equivalent under pretty weak assumptions), so it's not
surprising the x value doesn't correspond perfectly to the y value, and the
line ends up "wrong".

So is it a bug?  Well, that depends on Tukey's definition.  I don't have
a copy of his book handy so I can't really say.  Maybe the R function is
doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
is wrong.

Duncan Murdoch



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Serguei Sokol
Ingenieur de recherche INRA
Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)

LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04

tel: +33 5 6155 9276
fax: +33 5 6704 8825
email: so...@insa-toulouse.fr
http://metasys.insa-toulouse.fr
http://www.lisbp.fr

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Interpreting R memory profiling statistics from Rprof() and gc()

2017-05-29 Thread Tomas Kalibera


On 05/18/2017 06:54 PM, Joy wrote:

Sorry, this might be a really basic question, but I'm trying to interpret
the results from memory profiling, and I have a few questions (marked by
*Q#*).

 From the summaryRprof() documentation, it seems that the four columns of
statistics that are reported when setting memory.profiling=TRUE are
- vector memory in small blocks on the R heap
- vector memory in large blocks (from malloc)
- memory in nodes on the R heap
- number of calls to the internal function duplicate in the time interval
(*Q1:* Are the units of the first 3 stats in bytes?)
In Rprof.out, vector memory in small and large blocks is given in 8-byte 
units (for historical reasons), but memory in nodes is given in bytes - 
this is not documented/guaranteed in documentation. In 
summaryRprof(memory="both"), memory usage is given in megabytes as 
documented.
For summaryRprof(memory="stats" and memory="tseries") I clarified in 
r72743, now memory usage is in bytes and it is documented.


and from the gc() documentation, the two rows represent
- ‘"Ncells"’ (_cons cells_), usually 28 bytes each on 32-bit systems and 56
bytes on 64-bit systems,
- ‘"Vcells"’ (_vector cells_, 8 bytes each)
(*Q2:* how are Ncells and Vcells related to small heap/large heap/memory in
nodes?)

Ncells describe memory in nodes (Ncells is the number of nodes).

Vcells describe memory in "small heap" + "large heap". A Vcell today 
does not have much meaning, it is shown for historical reasons, but the 
interesting thing is that Vcells*56 (or 28 on 32-bit systems) gives the 
number of bytes in "small heap"+"large heap" objects.



And I guess the question that lead to these other questions is - *Q3:* I'd
like to plot out the total amount of memory used over time, and I don't
think Rprofmem() give me what I'd like to know because, as I'm
understanding it, Rprofmem() records the amount of memory allocated with
each call, but this doesn't tell me the total amount of memory R is using,
or am I mistaken?
Rprof controls a sampling profiler which regularly asks the GC how much 
memory is currently in use on the R heap (but beware, indeed some of 
that memory is no longer reachable but has not yet been collected - 
running gc more frequently helps, and some of the memory may still be 
reachable but will not be used anymore). You can get this data by 
summaryRprof(memory="tseries") and plot them - add columns 1+2 or 1+2+3 
depending on what you want, in 72743 or more recent, in older version 
you need to multiply columns 1 and 2 by 8. To run the GC more frequently 
you can use gctorture.


Or if you are happy modifying your own R code and you don't insist on 
querying the memory size very frequently, you can also explicitly call 
gc(verbose=T) repeatedly. For this you won't need to use the profiler.


If you were looking instead at how much memory the whole R instance was 
using (that is, including memory allocated by the R gc but not presently 
used for R objects, including memory outside R heap), the easiest way 
would be to use facilities of your OS.


Rprofmem is a different thing and won't help you.

Best
Tomas



Thanks in advance!

Joy

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

2017-05-29 Thread Serguei Sokol


Here is an attached patch.

Best,
Serguei.

Le 29/05/2017 à 12:21, Serguei Sokol a écrit :

The problem or actual R implementation relies on an assumption
that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6)
which reveals not to be true despite very trustful appearance.

If we continue with the example of x=y=1:9
then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one)
while median(y[i] | x[i] <= quantile(x, 1/3))=2
On the other sample's side we've got 7.5 and 8 for x and y respectively.
Hence the slope (8-2)/(7.5-2.5)=1.2

To get a correct version of this, one should calculate x robust points in the 
same way as the y's,
i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= 
quantile(x, 2/3))

Best,
Serguei.

Le 29/05/2017 à 10:02, peter dalgaard a écrit :

A usually trustworthy R correspondent posted a pure R implementation on SO at 
some point in his lost youth:

https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line

This one does indeed generate the line of identity for the (1:9, 1:9) case, so 
I do suspect that we have a genuine scr*wup in line().

Notice, incidentally, that


line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))

Call:
line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))

Coefficients:
[1]  -0.9407   1.1948

I.e., it is not likely an issue with exact integers or perfect fit.

-pd




On 29 May 2017, at 07:21 , GlenB  wrote:


Tukey divides the points into three groups, not the x and y values

separately.


I'll try to get hold of the book for a direct quote, might take a couple

of days.

Ah well, I can't get it for a week. But the fact that it's often called
Tukey's three group line (try a search on *tukey three group line* and
you'll get plenty of hits) is pretty much a giveaway.


On Mon, May 29, 2017 at 2:19 PM, GlenB  wrote:


Tukey divides the points into three groups, not the x and y values
separately.

I'll try to get hold of the book for a direct quote, might take a couple
of days.



On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch 
wrote:


On 27/05/2017 9:28 PM, GlenB wrote:


Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
or
3

Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
intercept -1 and slope 1.2

Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
of
length 6, with four of every six correct.

Bug has been present across many versions.

The machine I just tried it on just now has R3.2.3:


If you look at the source (in src/library/stats/src/line.c), the
explanation is clear:  the x value is chosen as the 1/6 quantile (according
to a particular definition of quantile), and the y value is chosen as the
median of the y values where x is less than or equal to the 1/3 quantile.
Those are different definitions (though I think they would be
asymptotically equivalent under pretty weak assumptions), so it's not
surprising the x value doesn't correspond perfectly to the y value, and the
line ends up "wrong".

So is it a bug?  Well, that depends on Tukey's definition.  I don't have
a copy of his book handy so I can't really say.  Maybe the R function is
doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
is wrong.

Duncan Murdoch



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel





--
Serguei Sokol
Ingenieur de recherche INRA
Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)

LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04

tel: +33 5 6155 9276
fax: +33 5 6704 8825
email: so...@insa-toulouse.fr
http://metasys.insa-toulouse.fr
http://www.lisbp.fr

--- line.c.orig	2016-03-17 00:03:03.0 +0100
+++ line.c	2017-05-29 15:06:55.727508967 +0200
@@ -25,8 +25,8 @@
 
 /* Speed up by `inlining' these (as macros) [since R version 1.2] : */
 #if 1
-#define il(n,x)	(int)floor((n - 1) * x)
-#define iu(n,x)	(int) ceil((n - 1) * x)
+#define il(n,x)	(int)floor(((n) - 1) * (x))
+#define iu(n,x)	(int) ceil(((n) - 1) * (x))
 
 #else
 static int il(int n, double x)
@@ -50,53 +50,53 @@
 double slope, yint;
 
 for(i = 0 ; i < n ; i++) {
-	z[i] = x[i];
-	w[i] = y[i];
+		z[i] = x[i];
+		w[i] = y[i];
 }
 R_rsort(z, n);/* z = ordered abscissae */
 
-tmp1 = z[il(n, 1./6.)];
-tmp2 = z[iu(n, 1./6.)];	xb = 0.5*(tmp1+tmp2);
-
 tmp1 = z[il(n, 2./6.)];
-tmp2 = z[iu(n, 2./6.)];	x1 = 0.5*(tmp1+tmp2);
+k = iu(n, 2./6.);
+tmp2 = z[k];	x1 = 0.5*(tmp1+tmp2);
 
 tmp1 = z[il(n, 4./6.)];
 tmp2 = z[iu(n, 4./6.)];	x2 = 0.5*(tmp1+tmp2);
 
-tmp1 = z[il(n, 5./6.)];
-tmp2 = z[iu(n, 5./6.)];	xt = 0.5*(tmp1+tmp2);
-
 slope = 0.;
 
 for(j = 1 ; j <= 1 ; j++) {
-	/* yb := Median(y[i]; x[i] <= quantile(x, 1/3) */
-	k = 0;
-	for(i = 0 ; i < n ; i++)
-	if(x[i] <= x1)
-		z[k+

Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

2017-05-29 Thread Serguei Sokol


Sorry, I have seen it too late that we had different tab width in the original 
file and my editor.
Here is the patch with all white spaces instead of mixing tabs and white spaces.

Serguei.

Le 29/05/2017 à 15:13, Serguei Sokol a écrit :

Here is an attached patch.

Best,
Serguei.

Le 29/05/2017 à 12:21, Serguei Sokol a écrit :

The problem or actual R implementation relies on an assumption
that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6)
which reveals not to be true despite very trustful appearance.

If we continue with the example of x=y=1:9
then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not R's one)
while median(y[i] | x[i] <= quantile(x, 1/3))=2
On the other sample's side we've got 7.5 and 8 for x and y respectively.
Hence the slope (8-2)/(7.5-2.5)=1.2

To get a correct version of this, one should calculate x robust points in the 
same way as the y's,
i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >= 
quantile(x, 2/3))

Best,
Serguei.

Le 29/05/2017 à 10:02, peter dalgaard a écrit :

A usually trustworthy R correspondent posted a pure R implementation on SO at 
some point in his lost youth:

https://stackoverflow.com/questions/3224731/john-tukey-median-median-or-resistant-line-statistical-test-for-r-and-line

This one does indeed generate the line of identity for the (1:9, 1:9) case, so 
I do suspect that we have a genuine scr*wup in line().

Notice, incidentally, that


line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))

Call:
line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))

Coefficients:
[1]  -0.9407   1.1948

I.e., it is not likely an issue with exact integers or perfect fit.

-pd




On 29 May 2017, at 07:21 , GlenB  wrote:


Tukey divides the points into three groups, not the x and y values

separately.


I'll try to get hold of the book for a direct quote, might take a couple

of days.

Ah well, I can't get it for a week. But the fact that it's often called
Tukey's three group line (try a search on *tukey three group line* and
you'll get plenty of hits) is pretty much a giveaway.


On Mon, May 29, 2017 at 2:19 PM, GlenB  wrote:


Tukey divides the points into three groups, not the x and y values
separately.

I'll try to get hold of the book for a direct quote, might take a couple
of days.



On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch 
wrote:


On 27/05/2017 9:28 PM, GlenB wrote:


Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2
or
3

Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives
intercept -1 and slope 1.2

Trying line(1:i,1:i) across a range of i makes it clear there's a cycle
of
length 6, with four of every six correct.

Bug has been present across many versions.

The machine I just tried it on just now has R3.2.3:


If you look at the source (in src/library/stats/src/line.c), the
explanation is clear:  the x value is chosen as the 1/6 quantile (according
to a particular definition of quantile), and the y value is chosen as the
median of the y values where x is less than or equal to the 1/3 quantile.
Those are different definitions (though I think they would be
asymptotically equivalent under pretty weak assumptions), so it's not
surprising the x value doesn't correspond perfectly to the y value, and the
line ends up "wrong".

So is it a bug?  Well, that depends on Tukey's definition.  I don't have
a copy of his book handy so I can't really say.  Maybe the R function is
doing exactly what Tukey said it should, and that's not a bug.  Or maybe R
is wrong.

Duncan Murdoch



[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel







__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



--
Serguei Sokol
Ingenieur de recherche INRA
Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)

LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
135 Avenue de Rangueil
31077 Toulouse Cedex 04

tel: +33 5 6155 9276
fax: +33 5 6704 8825
email: so...@insa-toulouse.fr
http://metasys.insa-toulouse.fr
http://www.lisbp.fr

--- line.c.orig	2016-03-17 00:03:03.0 +0100
+++ line.c	2017-05-29 15:24:16.871861092 +0200
@@ -17,7 +17,7 @@
  *  https://www.R-project.org/Licenses/
  */
 
-#include 	/* R_rsort() */
+#include /* R_rsort() */
 #include 
 
 #include 
@@ -25,8 +25,8 @@
 
 /* Speed up by `inlining' these (as macros) [since R version 1.2] : */
 #if 1
-#define il(n,x)	(int)floor((n - 1) * x)
-#define iu(n,x)	(int) ceil((n - 1) * x)
+#define il(n,x) (int)floor(((n) - 1) * (x))
+#define iu(n,x) (int) ceil(((n) - 1) * (x))
 
 #else
 static int il(int n, double x)
@@ -41,69 +41,69 @@
 #endif
 
 static void line(double *x, double *y, /* input (x[i],y[i])s */
-		 double *z, double *w, /* work and output: resid. & fitted */
-		 /* all the above of length */ int n,
-		 double coef[2])
+ doubl

Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

2017-05-29 Thread GlenB

Incidentally (though I don't expect anyone will want to pursue it)
Johnstone & Velleman give standard errors for the estimates in Johnstone,
Iain M., and Paul F. Velleman. “The Resistant Line and Related Regression
Methods.” Journal of the American Statistical Association, vol. 80, no.
392, 1985, pp. 1041–1054.

On Mon, May 29, 2017 at 11:13 PM, Serguei Sokol 
wrote:

> Here is an attached patch.
>
> Best,
> Serguei.
>
>
> Le 29/05/2017 à 12:21, Serguei Sokol a écrit :
>
>> The problem or actual R implementation relies on an assumption
>> that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6)
>> which reveals not to be true despite very trustful appearance.
>>
>> If we continue with the example of x=y=1:9
>> then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not
>> R's one)
>> while median(y[i] | x[i] <= quantile(x, 1/3))=2
>> On the other sample's side we've got 7.5 and 8 for x and y respectively.
>> Hence the slope (8-2)/(7.5-2.5)=1.2
>>
>> To get a correct version of this, one should calculate x robust points in
>> the same way as the y's,
>> i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i]
>> >= quantile(x, 2/3))
>>
>> Best,
>> Serguei.
>>
>> Le 29/05/2017 à 10:02, peter dalgaard a écrit :
>>
>>> A usually trustworthy R correspondent posted a pure R implementation on
>>> SO at some point in his lost youth:
>>>
>>> https://stackoverflow.com/questions/3224731/john-tukey-media
>>> n-median-or-resistant-line-statistical-test-for-r-and-line
>>>
>>> This one does indeed generate the line of identity for the (1:9, 1:9)
>>> case, so I do suspect that we have a genuine scr*wup in line().
>>>
>>> Notice, incidentally, that
>>>
>>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1))

>>> Call:
>>> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1))
>>>
>>> Coefficients:
>>> [1]  -0.9407   1.1948
>>>
>>> I.e., it is not likely an issue with exact integers or perfect fit.
>>>
>>> -pd
>>>
>>>
>>>
>>> On 29 May 2017, at 07:21 , GlenB  wrote:

 Tukey divides the points into three groups, not the x and y values
>
 separately.

 I'll try to get hold of the book for a direct quote, might take a couple
>
 of days.

 Ah well, I can't get it for a week. But the fact that it's often called
 Tukey's three group line (try a search on *tukey three group line* and
 you'll get plenty of hits) is pretty much a giveaway.


 On Mon, May 29, 2017 at 2:19 PM, GlenB  wrote:

 Tukey divides the points into three groups, not the x and y values
> separately.
>
> I'll try to get hold of the book for a direct quote, might take a
> couple
> of days.
>
>
>
> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch <
> murdoch.dun...@gmail.com>
> wrote:
>
> On 27/05/2017 9:28 PM, GlenB wrote:
>>
>> Bug: stats::line() does not produce correct Tukey line when n mod 6
>>> is 2
>>> or
>>> 3
>>>
>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it
>>> gives
>>> intercept -1 and slope 1.2
>>>
>>> Trying line(1:i,1:i) across a range of i makes it clear there's a
>>> cycle
>>> of
>>> length 6, with four of every six correct.
>>>
>>> Bug has been present across many versions.
>>>
>>> The machine I just tried it on just now has R3.2.3:
>>>
>>> If you look at the source (in src/library/stats/src/line.c), the
>> explanation is clear:  the x value is chosen as the 1/6 quantile
>> (according
>> to a particular definition of quantile), and the y value is chosen as
>> the
>> median of the y values where x is less than or equal to the 1/3
>> quantile.
>> Those are different definitions (though I think they would be
>> asymptotically equivalent under pretty weak assumptions), so it's not
>> surprising the x value doesn't correspond perfectly to the y value,
>> and the
>> line ends up "wrong".
>>
>> So is it a bug?  Well, that depends on Tukey's definition.  I don't
>> have
>> a copy of his book handy so I can't really say.  Maybe the R function
>> is
>> doing exactly what Tukey said it should, and that's not a bug.  Or
>> maybe R
>> is wrong.
>>
>> Duncan Murdoch
>>
>>
>> [[alternative HTML version deleted]]

 __
 R-devel@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-devel

>>>
>>
>>
> --
> Serguei Sokol
> Ingenieur de recherche INRA
> Metabolisme Integre et Dynamique des Systemes Metaboliques (MetaSys)
>
> LISBP, INSA/INRA UMR 792, INSA/CNRS UMR 5504
> 135 Avenue de Rangueil
> 31077 Toulouse Cedex 04
>
> tel: +33 5 6155 9276
> fax: +33 5 6704 8825
> email: so...@insa-toulouse.fr
> http://metasys.insa-toulouse.fr
> http://www.lisbp.fr
>
>
> __
> R-devel@r-project.org mailing list
> htt

Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Re: [Rd] Interpreting R memory profiling statistics from Rprof() and gc()

Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3

6 matches

Site Navigation

Mail list logo

Footer information