[R] VIF's in R using BIGLM
Dear R-help This is a follow-up to my previous post here: http://groups.google.com/group/r-help-archive/browse_thread/thread/d9b6f87ce06a9fb7/e9be30a4688f239c?lnk=gst&q=dobomode#e9be30a4688f239c I am working on developing an open-source automated system for running batch-regressions on very large datasets. In my previous post, I posed the question of obtaining VIF's from the output of BIGLM. With a lot of help from Assoc. Professor, Biostatistics Thomas Lumley at University of Washington, I was able to make significant progress, but ultimately got stuck. The following post describes the steps and reasoning I undertook in trying to accomplish this task. Please note that I am not a statistician so ignore any commentary that seems naive to you. A quick intro. The goal is to obtain VIF's (variance inflation factors) from the regression output of BIGLM. Traditionally, this has been possible with the regular lm() function. Follows a quick illustration (the model below is pretty silly, only for illustration purposes). Example dataset: > mtcars mpg cyl disp hp dratwt qsec vs am gear carb Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 14 4 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 14 4 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 14 1 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 03 1 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 03 2 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 03 1 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 03 4 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 04 2 Merc 23022.8 4 140.8 95 3.92 3.150 22.90 1 04 2 Merc 28019.2 6 167.6 123 3.92 3.440 18.30 1 04 4 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 04 4 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 03 3 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 03 3 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 03 3 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 03 4 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 03 4 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 03 4 Fiat 12832.4 4 78.7 66 4.08 2.200 19.47 1 14 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 14 2 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 14 1 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 03 1 Dodge Challenger15.5 8 318.0 150 2.76 3.520 16.87 0 03 2 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 03 2 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 03 4 Pontiac Firebird19.2 8 400.0 175 3.08 3.845 17.05 0 03 2 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 14 1 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 15 2 Lotus Europa30.4 4 95.1 113 3.77 1.513 16.90 1 15 2 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 15 4 Ferrari Dino19.7 6 145.0 175 3.62 2.770 15.50 0 15 6 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 15 8 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 14 2 Example model: model <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb Regression: reg_lm <- lm(model, mtcars) Results: > summary(reg_lm) Call: lm(formula = model, data = mtcars) Residuals: Min 1Q Median 3Q Max -3.4506 -1.6044 -0.1196 1.2193 4.6271 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 12.30337 18.71788 0.657 0.5181 cyl -0.111441.04502 -0.107 0.9161 disp 0.013340.01786 0.747 0.4635 hp -0.021480.02177 -0.987 0.3350 drat 0.787111.63537 0.481 0.6353 wt -3.715301.89441 -1.961 0.0633 . qsec 0.821040.73084 1.123 0.2739 vs 0.317762.10451 0.151 0.8814 am 2.520232.05665 1.225 0.2340 gear 0.655411.49326 0.439 0.6652 carb-0.199420.82875 -0.241 0.8122 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.65 on 21 degrees of freedom Multiple R-squared: 0.869, Adjusted R-squared: 0.8066 F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07 VIF's: > vif(reg_lm) cyl disphp dratwt qsec vsam gear carb 15.373833 21.620241 9.832037 3.374620 15.164887 7.527958 4.965873 4.648487 5.357452 7.908747 Here is the definition of vif() (courtesy of http://www.stat.sc.edu/~hitchcock/bodyfatRexample.txt): > vif.lm function(object, ...) { V <- summary(object)$cov.unscaled Vi <- crossprod(model.matrix(object)) nam <- names(coef(object)
[R] Running out of memory when importing SPSS files
Hello R-help, I am trying to import a large dataset from SPSS into R. The SPSS file is in .SAV format and is about 1GB in size. I use read.spss to import the file and get an error saying that I have run out of memory. I am on a MAC OS X 10.5 system with 4GB of RAM. Monitoring the R process tells me that R runs out of memory when reaching about 3GB of RAM so I suppose the remaining 1GB is used up by the OS. Why would a 1GB SPSS file take up more than 3GB of memory in R? Is it perhaps because R is converting each SPSS column to a less memory- efficient data type? In general, what is the best strategy to load large datasets in R? Thanks! P.S. I exported the SPSS .SAV file to .CSV and tried importing the comma delimited file. Same results – the import was much slower but eventually I ran out of memory again... __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Questions about biglm
Hello folks, I am very excited to have discovered R and have been exploring its capabilities. R's regression models are of great interest to me as my company is in the business of running thousands of linear regressions on large datasets. I am using biglm to run linear regressions on datasets that are as large as several GB's. I have been pleasantly surprised that biglm runs the regressions extremely fast (one regression may take minutes in SPSS vs seconds in R). I have been trying to wrap my head around biglm and have a couple of questions. 1. How can I get VIF's (Variance Inflation Factors) using biglm? I was able to get VIF's from the regular lm function using this piece of code I found through Google, but have not been able to adapt it to work with biglm. Hasn't anyone been successful in this? vif.lm <- function(object, ...) { V <- summary(object)$cov.unscaled Vi <- crossprod(model.matrix(object)) nam <- names(coef(object)) if(k <- match("(Intercept)", nam, nomatch = F)) { v1 <- diag(V)[-k] v2 <- (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k]) nam <- nam[-k] } else { v1 <- diag(V) v2 <- diag(Vi) warning("No intercept term detected. Results may surprise.") } structure(v1*v2, names = nam) } 2. How reliable / stable is biglm's update() function? I was experimenting with running regressions on individual chunks of my large dataset, but the coefficients I got were different compared to those obtained form running biglm on the whole dataset. Am I mistaken when I say that update() is intended to run regressions in chunks (when memory becomes an issue with datasets that are too large) and produce identical results to running a single regression on the dataset as a whole? Thanks! Dobo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Running out of memory when importing SPSS files
I found the culprit. I had a number of variables in the SPSS file that were a variable length string data type (255 characters). This seemed to force R into creating 255-byte variables which eventually choked my machine's memory... On Feb 18, 5:34 pm, Uwe Ligges wrote: > dobomodewrote: > > Hello R-help, > > > I am trying to import a large dataset from SPSS into R. The SPSS file > > is in .SAV format and is about 1GB in size. I use read.spss to import > > the file and get an error saying that I have run out of memory. I am > > on a MAC OS X 10.5 system with 4GB of RAM. Monitoring the R process > > tells me that R runs out of memory when reaching about 3GB of RAM so I > > suppose the remaining 1GB is used up by the OS. > > > Why would a 1GB SPSS file take up more than 3GB of memory in R? > > Because SPSS stores data in a compressed way? > > > Is it > > > perhaps because R is converting each SPSS column to a less memory- > > efficient data type? In general, what is the best strategy to load > > large datasets in R? > > Use a 64-bit version of R and have sufficient amount of RAM in your system. > > Uwe Ligges > > > Thanks! > > > P.S. > > > I exported the SPSS .SAV file to .CSV and tried importing the comma > > delimited file. Same results – the import was much slower but > > eventually I ran out of memory again... > > > __ > > r-h...@r-project.org mailing list > >https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > __ > r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Using JRI and Java 1.6 on MAC OS X
Dear R-Help, I am trying to get JRI (the rJava interface allowing Java to connect to R) to work. I was able to run it a week ago when I was doing some testing using Java 1.5. However, I am developing a GUI application using some of the new Java 1.6 features and I just can't get JRI to work with this setup. Here is what I get: Cannot find JRI native library! Please make sure that the JRI native library is in a directory listed in java.library.path. java.lang.UnsatisfiedLinkError: /Library/Frameworks/R.framework/ Versions/2.8/Resources/library/rJava/jri/libjri.jnilib: at java.lang.ClassLoader$NativeLibrary.load(Native Method) at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1822) at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1739) at java.lang.Runtime.loadLibrary0(Runtime.java:823) at java.lang.System.loadLibrary(System.java:1030) at org.rosuda.JRI.Rengine.(Rengine.java:9) at mfa.mes.gui.MESFrame.initR(MESFrame.java:79) at mfa.mes.gui.MESFrame.(MESFrame.java:313) at mfa.mes.MES.main(MES.java:131) Java Result: 1 Notice that it did actually find the JRI.jar library. The error seems to be related to the native JNI. I have set my java library path and R_HOME correctly. I saw this entry in the changelog for rJava: 0.4-10 2006-09-14 o Removed obsolete JNI 1.1 support that is no longer provided in JDK 1.6 and thus prevented rJava from being used with JDK 1.6 I am curious if this change has been applied to JRI as well. It would be very unfortunate if JRI is incompatible with the latest JDK. I am running NetBeans / JDK 1.6 on Mac OS X 10.5.6. Any help would be greatly appreciated! Dobo __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.