The discussion of Leo Breiman's paper in Statistical Science: Statistical Modeling - The Two cultures, is a must read for all statisticians doing prediction modeling. Especially see the exchange between Cox and Breiman (I call this the Cox-Breiman duel).
Ravi. ____________________________________________________________________ Ravi Varadhan, Ph.D. Assistant Professor, Division of Geriatric Medicine and Gerontology School of Medicine Johns Hopkins University Ph. (410) 502-2619 email: rvarad...@jhmi.edu ----- Original Message ----- From: Bert Gunter <gunter.ber...@gene.com> Date: Thursday, April 1, 2010 12:55 pm Subject: Re: [R] sample size > 20K? Was: fitness of regression tree: how to measure??? To: 'Frank E Harrell Jr' <f.harr...@vanderbilt.edu>, 'vibha patel' <vibhapatel...@gmail.com> Cc: r-help@r-project.org > Since Frank has made this somewhat cryptic remark (sample size > 20K) > several times now, perhaps I can add a few words of (what I hope is) further > clarification. > > Despite any claims to the contrary, **all** statistical (i.e. empirical) > modeling procedures are just data interpolators: that is, all that > they can > claim to do is produce reasonable predictions of what may be expected > within > the extent of the data. The quality of the model is judged by the goodness > of fit/prediction over this extent. Ergo the standard textbook caveats > about > the dangers of extrapolation when using fitted models for prediction. > Note, > btw, the contrast to "mechanistic" models, which typically **are** assessed > by how well they **extrapolate** beyond current data. For example, Newton's > apple to the planets. They are often "validated" by their ability to "work" > in circumstances (or scales) much different than those from which they > were > derived. > > So statistical models are just fancy "prediction engines." In particular, > there is no guarantee that they provide any meaningful assessment of > variable importance: how predictors causally relate to the response. > Obviously, empirical modeling can often be useful for this purpose, > especially in well-designed studies and experiments, but there's no > guarantee: it's an "accidental" byproduct of effective prediction. > > This is particularly true for happenstance (un-designed) data and > non-parametric models like regression/classification trees. Typically, > there > are many alternative models (trees) that give essentially the same quality > of prediction. You can see this empirically by removing a modest random > subset of the data and re-fitting. You should not be surprised to see > the > fitted model -- the tree topology -- change quite radically. HOWEVER, > the > predictions of the models within the extent of the data will be quite > similar to the original results. Frank's point is that unless the data > set > is quite large and the predictive relationships quite strong -- which > usually implies parsimony -- this is exactly what one should expect. > Thus it > is critical not to over-interpret the particular model one get, i.e. to > infer causality from the model (tree)structure. > > Incidentally, there is nothing new or radical in this; indeed, John Tukey, > Leo Breiman, George Box, and others wrote eloquently about this > decades ago. > And Breiman's random forest modeling procedure explicitly abandoned efforts > to build simply interpretable models (from which one might infer causality) > in favor of building better interpolators, although assessment of "variable > importance" does try to recover some of that interpretability > (however, no > guarantees are given). > > HTH. And contrary views welcome, as always. > > Cheers to all, > > Bert Gunter > Genentech Nonclinical Biostatistics > > > -----Original Message----- > From: r-help-boun...@r-project.org [ On > Behalf Of Frank E Harrell Jr > Sent: Thursday, April 01, 2010 5:02 AM > To: vibha patel > Cc: r-help@r-project.org > Subject: Re: [R] fitness of regression tree: how to measure??? > > vibha patel wrote: > > Hello, > > > > I'm using rpart function for creating regression trees. > > now how to measure the fitness of regression tree??? > > > > thanks n Regards, > > Vibha > > If the sample size is less than 20,000, assume that the tree is a > somewhat arbitrary representation of the relationships in the data and > > that the form of the tree will not replicate in future datasets. > > Frank > > -- > Frank E Harrell Jr Professor and Chairman School of Medicine > Department of Biostatistics Vanderbilt University > > ______________________________________________ > R-help@r-project.org mailing list > > PLEASE do read the posting guide > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > > PLEASE do read the posting guide > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.