Dear Martin, > -----Original Message----- > From: Martin Maechler [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 14, 2005 3:56 AM > To: John Fox > Cc: 'Martin Maechler'; 'Werner Stahel'; 'John Maindonald'; > R-devel@stat.math.ethz.ch > Subject: Re: [Rd] plot(<lm>): new behavior in R-2.2.0 alpha > > Thank you, John, for > Dear > >>>>> "JohnF" == John Fox <[EMAIL PROTECTED]> > >>>>> on Tue, 13 Sep 2005 16:41:28 -0400 writes: > > JohnF> A couple of comments on the new plots (numbers 5 and 6): > JohnF> Perhaps some more thought could be given to the > JohnF> plotted contours for Cook's D (which are 0.5 and 1.0 > JohnF> in the example -- large Cook's Ds). A rule-of-thumb > JohnF> cut-off for this example is 4/(n - p) = 4/(50 - 5) = > JohnF> 0.089, and the discrepancy will grow with n. > > That's an interesting suggestion. Where does the 4/(n-p) > come from? or put differently, should I better read in of > your books? ;-)
I believe that I got this by transforming a cutoff suggested by Chatterjee and Hadi for dffits to the Cook's D scale. > > Honestly, I'm so much a fan of R_i / h_ii that I didn't even > know that. > > JohnF> I'm not terribly fond of number 6, since it seems > JohnF> natural to me to think of the relationship among > JohnF> these quantities as influence on coefficients = > JohnF> leverage * outlyingness (which corresponds to 5); > JohnF> also note how in the example, the labels for large > JohnF> residuals overplot. > > I think John mainly proposed '6' because other proposed it as > another good alternative. From the few examples I've looked > at, I haven't got fond at all either. > > JohnF> Finally, your remarks about balanced data are cogent > JohnF> and suggest going with 1:3 in this case (since R_i > JohnF> vs. i is pretty redundant with the QQ plot). > > Ah, that's another, maybe better alternative to my proposal. > > One drawback of it is for situations where people do something > like par(mfrow=c(2,2)) > before calling plot(<lm>) for several fitted lm models, > assuming to fill one page for each of the plots. Good point -- I do that myself in the Rcmdr package. > and I think that's something I would have done always in such > situations where several different models are fitted and compared. > > Maybe plot.lm() should "advance an empty frame" as soon as > prod(par("mfrow")) >= 4 > in that case? > That seems a good idea, at least for the default behaviour. An unlikely complication would occur if the user wanted to put the plots generated by plot.lm() on a page along with other plots. Regards, John > Martin > > JohnF> -------------------------------- > JohnF> John Fox > JohnF> Department of Sociology > JohnF> McMaster University > JohnF> Hamilton, Ontario > JohnF> Canada L8S 4M4 > JohnF> 905-525-9140x23604 > JohnF> http://socserv.mcmaster.ca/jfox > JohnF> -------------------------------- > > >> -----Original Message----- > >> From: Martin Maechler [mailto:[EMAIL PROTECTED] > >> Sent: Tuesday, September 13, 2005 9:18 AM > >> To: R-devel@stat.math.ethz.ch > >> Cc: John Maindonald; Werner Stahel; John Fox > >> Subject: plot(<lm>): new behavior in R-2.2.0 alpha > >> > >> As some of you R-devel readers may know, the plot() method > >> for "lm" objects is based in large parts on contributions by > >> John Maindonald, subsequently "massaged" by me and other > >> R-core members. > >> > >> In the statistics litterature on applied regression, people > >> have had diverse oppinions on what (and how many!) plots > >> should be used for goodness-of-fit / residual diagnostics, > >> and to my knowledge most people have agreed to want to see > >> one (or more) version of a Tukey-Anscombe plot {Residuals ~ > >> Fitted} and a QQ normal plot. > >> Another consideration was to be somewhat close to what S > >> (S-plus) was doing. So we have two versions of residuals vs > >> fitted, one for checking E[error] = 0, the other for > >> checking Var[error] = constant. So we got to the > first three plots of > >> plot.lm() about which I don't want to debate at the moment > >> {though, there's room for improvement even there: e.g., I > >> know of at least one case where plot(<lm>) wasn't used > >> because the user was missing the qqline() she was so used to > >> in the QQ plot} > >> > >> The topic of this e-mail is the (default) 4th plot which I > >> had changed; really prompted by the following: > >> More than three months ago, John wrote > >> http://tolstoy.newcastle.edu.au/R/devel/05/04/0594.html > >> (which became a thread of about 20 messages, from Apr.23 > >> -- 29, 2005) > >> > >> and currently, > >> NEWS for R 2.2.0 alpha contains > >> > >> >> USER-VISIBLE CHANGES > >> >> > >> >> o plot(<lm object>) uses a new default for the > fourth panel when > >> >> 'which' is not specified. > >> >> ___ may change before release ___ > >> > >> and the header is > >> > >> plot.lm <- > >> function (x, which = c(1:3, 5), > >> caption = c("Residuals vs Fitted", > >> "Normal Q-Q", "Scale-Location", > >> "Cook's distance", "Residuals vs Leverage", > >> "Cook's distance vs Leverage"), > >> ......... ) {..............} > >> > >> So we now have 6 possible plots, where 1,2,3 and 5 are the > >> defaults (and 1,2,3,4 where the old defaults). > >> > >> For the influential points and combination of 'influential' > >> and 'outlier' > >> there have been quite a few more proposals in the past. R <= > >> 2.1.x has been plotting the Cook's distances vs. observation > >> number, whereas quite a few people in the past have noted > >> that all influence measures being more or less complicated > >> functions of residuals and "hat values" aka "leverages", > >> (R_i, h_{ii}), it would really make sense and fit more to the > >> other plots to plot residuals vs. Leverages --- with the > >> additional idea of adding *contours* of (equal) Cook's > >> distances to that plot, in case one would really want > to seem them. > >> > >> In the mean time, this has been *active* in R-devel for quite > >> a while, and we haven't received any new comments. > >> > >> One remaining problem I'd like to address is the "balanced AOV" > >> situation, something probably pretty rare nowadays in real > >> practice, but common of course in teaching ANOVA. > >> As you may remember, in a balanced design, all observations > >> have the same leverages h_{ii}, and the plot R_i vs h_ii > >> is really not so useful. In that case, the cook distances > >> CD_i = c * R_i ^2 and so CD_i vs i {the old "4-th plot in > >> plot.lm"} is > >> graphically identical to R_i^2 vs i. > >> Now in that case (of identical h_ii's), I think one would > >> really want "R_i vs i". > >> > >> Question to the interested parties: > >> > >> Should there be an automatism > >> ``when h_ii == const'' {"==" with a bit of numerical fuzz} > >> plot a) R_i vs i > >> or b) CD_i vs i > >> > >> or should users have to manually use > >> plot(<lm>, which=1:4, ...) > >> in such a case? > >> > >> Feedback very welcome, > >> particularly, you first look at the examples in help(plot.lm) > >> in *R-devel* aka R-2.2.0 alpha. > >> > >> Martin Maechler, ETH Zurich ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel