Dear Greg and Dobo, The vif() in the car package computes VIFs (and generalized VIFs) from the covariance matrix of the coefficients; I'm not sure whether it will work directly on objects produced by biglm() but if not it should be easily adapted to do so.
I hope this helps, John ------------------------------ John Fox, Professor Department of Sociology McMaster University Hamilton, Ontario, Canada web: socserv.mcmaster.ca/jfox > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf Of Greg Snow > Sent: February-19-09 11:35 AM > To: dobomode; r-help@r-project.org > Subject: Re: [R] Questions about biglm > > The idea of the biglm function is to only have part of the data in memory at > a time. You read in part of the data and run biglm on that section of the > data, then delete it from memory, load in the next part of the data and use > update to include the new data in the analysis, delete that, read in the next > group, run update, and repeat until you have processed all the data. The > result will then be the same as if you ran lm on the entire dataset (possible > slight differences due to rounding). The bigglm function or code from other > packages (SQLiteDF for one) can automate this a bit more. > > The code for VIF below uses the model.matrix command, this returns the x > matrix for the analysis when used with an lm object. Since biglm is based on > the idea of not having all the data in memory at once, I would be very > surprised if model.matrix worked with biglm objects, so that code is unlikely > to work as is. > > One approach is to do VIF and other diagnostics on a subset of the data > (random sample, stratified random sample) that fits easily into memory, then > after making decisions about the model based on the diagnostics, run the > final model with biglm to get the precise results using the full data set. > You can do the diagnostics on a couple different random subsets to confirm > the decisions made. > > Hope this helps, > > -- > Gregory (Greg) L. Snow Ph.D. > Statistical Data Center > Intermountain Healthcare > greg.s...@imail.org > 801.408.8111 > > > > -----Original Message----- > > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r- > > project.org] On Behalf Of dobomode > > Sent: Wednesday, February 18, 2009 9:34 PM > > To: r-help@r-project.org > > Subject: [R] Questions about biglm > > > > Hello folks, > > > > I am very excited to have discovered R and have been exploring its > > capabilities. R's regression models are of great interest to me as my > > company is in the business of running thousands of linear regressions > > on large datasets. > > > > I am using biglm to run linear regressions on datasets that are as > > large as several GB's. I have been pleasantly surprised that biglm > > runs the regressions extremely fast (one regression may take minutes > > in SPSS vs seconds in R). > > > > I have been trying to wrap my head around biglm and have a couple of > > questions. > > > > 1. How can I get VIF's (Variance Inflation Factors) using biglm? I was > > able to get VIF's from the regular lm function using this piece of > > code I found through Google, but have not been able to adapt it to > > work with biglm. Hasn't anyone been successful in this? > > > > vif.lm <- function(object, ...) { > > V <- summary(object)$cov.unscaled > > Vi <- crossprod(model.matrix(object)) > > nam <- names(coef(object)) > > if(k <- match("(Intercept)", nam, nomatch = F)) { > > v1 <- diag(V)[-k] > > v2 <- (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k]) > > nam <- nam[-k] > > } else { > > v1 <- diag(V) > > v2 <- diag(Vi) > > warning("No intercept term detected. Results may > > surprise.") > > } > > structure(v1*v2, names = nam) > > } > > > > 2. How reliable / stable is biglm's update() function? I was > > experimenting with running regressions on individual chunks of my > > large dataset, but the coefficients I got were different compared to > > those obtained form running biglm on the whole dataset. Am I mistaken > > when I say that update() is intended to run regressions in chunks > > (when memory becomes an issue with datasets that are too large) and > > produce identical results to running a single regression on the > > dataset as a whole? > > > > Thanks! > > > > Dobo > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting- > > guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.