[Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3 Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives intercept -1 and slope 1.2 Trying line(1:i,1:i) across a range of i makes it clear there's a cycle of length 6, with four of every six correct. Bug has been present across many versions. The machine I just tried it on just now has R3.2.3: _ platform x86_64-w64-mingw32 arch x86_64 os mingw32 system x86_64, mingw32 status major 3 minor 2.3 year 2015 month 12 day10 svn rev69752 language R version.string R version 3.2.3 (2015-12-10) nickname Wooden Christmas-Tree [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Tukey divides the points into three groups, not the x and y values separately. I'll try to get hold of the book for a direct quote, might take a couple of days. On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch wrote: > On 27/05/2017 9:28 PM, GlenB wrote: > >> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >> or >> 3 >> >> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >> intercept -1 and slope 1.2 >> >> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle of >> length 6, with four of every six correct. >> >> Bug has been present across many versions. >> >> The machine I just tried it on just now has R3.2.3: >> > > If you look at the source (in src/library/stats/src/line.c), the > explanation is clear: the x value is chosen as the 1/6 quantile (according > to a particular definition of quantile), and the y value is chosen as the > median of the y values where x is less than or equal to the 1/3 quantile. > Those are different definitions (though I think they would be > asymptotically equivalent under pretty weak assumptions), so it's not > surprising the x value doesn't correspond perfectly to the y value, and the > line ends up "wrong". > > So is it a bug? Well, that depends on Tukey's definition. I don't have a > copy of his book handy so I can't really say. Maybe the R function is > doing exactly what Tukey said it should, and that's not a bug. Or maybe R > is wrong. > > Duncan Murdoch > > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
> Tukey divides the points into three groups, not the x and y values separately. > I'll try to get hold of the book for a direct quote, might take a couple of days. Ah well, I can't get it for a week. But the fact that it's often called Tukey's three group line (try a search on *tukey three group line* and you'll get plenty of hits) is pretty much a giveaway. On Mon, May 29, 2017 at 2:19 PM, GlenB wrote: > Tukey divides the points into three groups, not the x and y values > separately. > > I'll try to get hold of the book for a direct quote, might take a couple > of days. > > > > On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch > wrote: > >> On 27/05/2017 9:28 PM, GlenB wrote: >> >>> Bug: stats::line() does not produce correct Tukey line when n mod 6 is 2 >>> or >>> 3 >>> >>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it gives >>> intercept -1 and slope 1.2 >>> >>> Trying line(1:i,1:i) across a range of i makes it clear there's a cycle >>> of >>> length 6, with four of every six correct. >>> >>> Bug has been present across many versions. >>> >>> The machine I just tried it on just now has R3.2.3: >>> >> >> If you look at the source (in src/library/stats/src/line.c), the >> explanation is clear: the x value is chosen as the 1/6 quantile (according >> to a particular definition of quantile), and the y value is chosen as the >> median of the y values where x is less than or equal to the 1/3 quantile. >> Those are different definitions (though I think they would be >> asymptotically equivalent under pretty weak assumptions), so it's not >> surprising the x value doesn't correspond perfectly to the y value, and the >> line ends up "wrong". >> >> So is it a bug? Well, that depends on Tukey's definition. I don't have >> a copy of his book handy so I can't really say. Maybe the R function is >> doing exactly what Tukey said it should, and that's not a bug. Or maybe R >> is wrong. >> >> Duncan Murdoch >> >> > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Incidentally (though I don't expect anyone will want to pursue it) Johnstone & Velleman give standard errors for the estimates in Johnstone, Iain M., and Paul F. Velleman. “The Resistant Line and Related Regression Methods.” Journal of the American Statistical Association, vol. 80, no. 392, 1985, pp. 1041–1054. On Mon, May 29, 2017 at 11:13 PM, Serguei Sokol wrote: > Here is an attached patch. > > Best, > Serguei. > > > Le 29/05/2017 à 12:21, Serguei Sokol a écrit : > >> The problem or actual R implementation relies on an assumption >> that median(x[i] | x[i] <= quantile(x, 1/3)) == quantile(x, 1/6) >> which reveals not to be true despite very trustful appearance. >> >> If we continue with the example of x=y=1:9 >> then quantile(x, 1/6)=2.5 (here quantile() is taken in C-code sens, not >> R's one) >> while median(y[i] | x[i] <= quantile(x, 1/3))=2 >> On the other sample's side we've got 7.5 and 8 for x and y respectively. >> Hence the slope (8-2)/(7.5-2.5)=1.2 >> >> To get a correct version of this, one should calculate x robust points in >> the same way as the y's, >> i.e. xb=median(x[i] | x[i] <= quantile(x, 1/3)) and xt=median(x[i] | x[i] >> >= quantile(x, 2/3)) >> >> Best, >> Serguei. >> >> Le 29/05/2017 à 10:02, peter dalgaard a écrit : >> >>> A usually trustworthy R correspondent posted a pure R implementation on >>> SO at some point in his lost youth: >>> >>> https://stackoverflow.com/questions/3224731/john-tukey-media >>> n-median-or-resistant-line-statistical-test-for-r-and-line >>> >>> This one does indeed generate the line of identity for the (1:9, 1:9) >>> case, so I do suspect that we have a genuine scr*wup in line(). >>> >>> Notice, incidentally, that >>> >>> line(1:9+rnorm(9,,1e-1),1:9+rnorm(9,,1e-1)) >>>> >>> Call: >>> line(1:9 + rnorm(9, , 0.1), 1:9 + rnorm(9, , 0.1)) >>> >>> Coefficients: >>> [1] -0.9407 1.1948 >>> >>> I.e., it is not likely an issue with exact integers or perfect fit. >>> >>> -pd >>> >>> >>> >>> On 29 May 2017, at 07:21 , GlenB wrote: >>>> >>>> Tukey divides the points into three groups, not the x and y values >>>>> >>>> separately. >>>> >>>> I'll try to get hold of the book for a direct quote, might take a couple >>>>> >>>> of days. >>>> >>>> Ah well, I can't get it for a week. But the fact that it's often called >>>> Tukey's three group line (try a search on *tukey three group line* and >>>> you'll get plenty of hits) is pretty much a giveaway. >>>> >>>> >>>> On Mon, May 29, 2017 at 2:19 PM, GlenB wrote: >>>> >>>> Tukey divides the points into three groups, not the x and y values >>>>> separately. >>>>> >>>>> I'll try to get hold of the book for a direct quote, might take a >>>>> couple >>>>> of days. >>>>> >>>>> >>>>> >>>>> On Mon, May 29, 2017 at 8:40 AM, Duncan Murdoch < >>>>> murdoch.dun...@gmail.com> >>>>> wrote: >>>>> >>>>> On 27/05/2017 9:28 PM, GlenB wrote: >>>>>> >>>>>> Bug: stats::line() does not produce correct Tukey line when n mod 6 >>>>>>> is 2 >>>>>>> or >>>>>>> 3 >>>>>>> >>>>>>> Example: line(1:9,1:9) should have intercept 0 and slope 1 but it >>>>>>> gives >>>>>>> intercept -1 and slope 1.2 >>>>>>> >>>>>>> Trying line(1:i,1:i) across a range of i makes it clear there's a >>>>>>> cycle >>>>>>> of >>>>>>> length 6, with four of every six correct. >>>>>>> >>>>>>> Bug has been present across many versions. >>>>>>> >>>>>>> The machine I just tried it on just now has R3.2.3: >>>>>>> >>>>>>> If you look at the source (in src/library/stats/src/line.c), the >>>>>> explanation is clear: the x value is chosen as the 1/6 quantile >>>>>> (according >>>>>> to a particular definition of quantile), and the y value is chosen as >>&g
Re: [Rd] stats::line() does not produce correct Tukey line when n mod 6 is 2 or 3
Martin Maechler says in reply to Sergueï Sokol > Note the 'Subject' you've chosen for this thread, "... does not produce the correct Tukey line", The choice of title was mine not Serguei's; I posted the original message where the error was pointed out I agree with Martin's assessment that the correct split (both by Tukey's lights and by general practice) for 19 points would be 6,7,6 and I also agree that it's better to "fix more" in this instance, where possible. (e.g. Johnstone&Velleman's standard errors would be a nice thing to add if feasible) -- but if any blame is attached to the choice of title, it really should be aimed at me. Glen On Wed, May 31, 2017 at 2:51 AM, Martin Maechler wrote: > > Serguei Sokol > > on Tue, 30 May 2017 16:01:17 +0200 writes: > > > Le 30/05/2017 à 09:33, Martin Maechler a écrit : ... > >> However, even after the patch, The example from the SO > >> post differs from the result of Richie Cotton's > >> function... > > The explanation is quite simple. In SO function, the first > > 1/3 quantile of used example counts 6 points (of 19 in > > total), while line()'s definition of quantile leads to 8 > > points. The same numbers (6 and 8) are on the other end of > > sample. > > so the number of obs. for the three thirds for line() are >{8, 3, 8} in line() [also, after your patch, right?] > > whereas in MMline() they are as they should be, namely > >{6, 7, 6} > > But the {8, 3, 8} split is not at all what all "the literature", > including Tukey himself says that "should" be done. > (Other literature on the topic suggests that the optimal sizes > of the split in three groups depends on the distribution of x ..) > > OTOH, MMline() does exactly what "the literature" and also the > reference on the ?line help pages says. > > > In x sample, there are few repeated values, this > > is certainly be the reason of different quantiles.. > > > I am not sure that one quantile definition is better or > > more correct than the other. > > > So I would leave line()'s definition as is. > > you mean _after_ applying your patch, I assume. > > I currently tend do disagree. If we change line() we should > rather fix more .. > Note the 'Subject' you've chosen for this thread, > "... does not produce the correct Tukey line", > so I think we should get better. > > Apart from Richie / my MMline() function, I've also noticed > that ACSWR :: resistant_line() > exists. > > However "the literature" (see references below), notably the two > with Hoaglin, strongly recommends smarter iterations, and > -- lo and behold! -- when this topic came up last (for me) in > Dec. 2014, I did spend about 2 days work (or more?) to get the > FORTRAN code from the 1981 - book (which is abbreviated the > "ABC of EDA") from a somewhat useful OCR scan into compilable > Fortran code and then f2c'ed, wrote an R interface function > found problems i.e., bugs, including infinite loops, fixed most > AFAICS, but somehow did not finish making the result available. > > Yes, and I have too many other things on my desk... this will > have to wait! > > References: > > Tukey, J. W. (1977). _Exploratory Data Analysis_, Reading > Massachusetts: Addison-Wesley. > > Velleman, P. F. and Hoaglin, D. C. (1981) _Applications, Basics > and Computing of Exploratory Data Analysis_ Duxbury Press. > > Emerson, J. D. and Hoaglin, D. C. (1983) Resistant Lines for y > versus x. Chapter 5 of _Understanding Robust and Exploratory Data > Analysis_, eds. David C. Hoaglin, Frederick Mosteller and John W. > Tukey. Wiley. > > Iain M. Johnstone and Paul F. Velleman (1985) The Resistant Line > and Related Regression Methods. _Journal of the American > Statistical Association_ *80*, 1041-1054.https://dx.doi.org/10.1080/01621459.1985.10478222> > > > > Best, Sergueï. > > Martin Maechler, ETH Zurich (and R core team) > > __ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel