Re: [R] R's Data Dredging Philosophy for Distribution Fitting

Frank E Harrell Jr Wed, 14 Jul 2010 18:34:27 -0700

On 07/14/2010 06:22 PM, emorway wrote:


Forum,

I'm a grad student in Civil Eng, took some Stats classes that required
students learn R, and I have since taken to R and use it for as much as I
can.  Back in my lab/office, many of my fellow grad students still use
proprietary software at the behest of advisers who are familiar with the
recommended software (Statistica, @Risk (Excel Add-on), etc).  I have spent
a lot of time learning R and am confident it can generally out-process,
out-graph, or more simply stated, out-perform most of these other software
packages.  However, one area my view has been humbled in is distribution
fitting.

I started by reading through
http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf  After that
I started digging around on this forum and found posts like this one
http://r.789695.n4.nabble.com/Fitting-usual-distributions-td800000.html#a800000
that are close to what I'm after.  That is, given an observation dataset, I
would like to call a function that cycles through numerous distributions
(common or not) and then ranks them for me based on Chi-Square,
Kolmogorov-Smirnov and/or Anderson-Darling, for example.

This question was asked back in 2004:
http://finzi.psych.upenn.edu/R/Rhelp02a/archive/37053.html but the response
was that this kind of thing wasn't in R nor in proprietary software to the
best of the responding author's memory.  In 2010, however, this is no longer
true as @Risk's
(http://www.palisade.com/risk/?gclid=CKvblPSM7KICFZQz5wodDRI2fg)
"Distribution Fitting" function does this very thing.  And it is here that
my R pride has taken a hit.  Based on the first response to the question
posed here
http://r.789695.n4.nabble.com/Which-distribution-best-fits-the-data-td859448.html#a859448
is it fair to say that the R community (I realize this is only 1 view) would
take exception to this kind of "data mining"?

Unless I've missed a discussion of a package that does this very thing, it
seems as though I would need to code something up using fitdistr() and do
all the ranking myself.  Undoubtedly that would be a good exercise for me,
but its hard for me to believe R would be a runner-up to something like
distribution fitting in @Risk.

Eric


Eric,

I didn't read the links you provided but the approach you have advocated(and you are not alone) is futile. If you entertain more than about 2distributions, the variance of the final fits is no better than thevariance of the empirical cumulative distribution function (once youproperly adjust variances for model uncertainty). So just go empirical.In general if your touchstone is the observed data (as in checkinggoodness of fit of various parametric distributions), your finalestimators will have the variance of empirical estimators.


Frank
--
Frank E Harrell Jr   Professor and Chairman        School of Medicine
                     Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R's Data Dredging Philosophy for Distribution Fitting

Reply via email to