On Sat, 21 Feb 2009, Charles C. Berry wrote:

On Sat, 21 Feb 2009, Tal Galili wrote:

Hello dear R mailing list members.

I have recently became curious of the possibility applying model
selection algorithms (even as simple as AIC) to regressions of large
datasets.


Large in the sense of many observations, one assumes.

But how large in terms of the number of variables??

If not too many variables, then you can form the regression sums of squares for all 2^p combinations of regressors from a biglm() fit of all variables as biglm provides coef() and vcov() methods.

If it is large, then you most likely will need to do subsampling to reduce the number to 'not too many' via lm() and friends then and apply the above strategy.


If you can fit the complete p-variable model (so you have more observations 
than variables) the search algorithms then don't require the raw data so the 
search time depends on p but not on n.  That's how the leaps package works, for 
example.  This is only for lm(), but you get a pretty good approximation for 
glm() by doing the search using the weighted linear model from the last 
iteration of IWLS, finding a reasonably large collection of best models, and 
then refitting them in glm() to see which is really best.

Of course, none of this solves the problem that AIC isn't correctly calibrated 
for searching large model spaces.


      -thomas

Thomas Lumley                   Assoc. Professor, Biostatistics
tlum...@u.washington.edu        University of Washington, Seattle

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to