[R] VIF's in R using BIGLM

2009-04-27 Thread dobomode
Dear R-help

This is a follow-up to my previous post here:
http://groups.google.com/group/r-help-archive/browse_thread/thread/d9b6f87ce06a9fb7/e9be30a4688f239c?lnk=gst&q=dobomode#e9be30a4688f239c

I am working on developing an open-source automated system for running
batch-regressions on very large datasets. In my previous post, I posed
the question of obtaining VIF's from the output of BIGLM. With a lot
of help from Assoc. Professor, Biostatistics Thomas Lumley at
University of Washington, I was able to make significant progress, but
ultimately got stuck. The following post describes the steps and
reasoning I undertook in trying to accomplish this task. Please note
that I am not a statistician so ignore any commentary that seems naive
to you.

A quick intro. The goal is to obtain VIF's (variance inflation
factors) from the regression output of BIGLM. Traditionally, this has
been possible with the regular lm() function. Follows a quick
illustration (the model below is pretty silly, only for illustration
purposes).

Example dataset:

> mtcars
 mpg cyl  disp  hp dratwt  qsec vs am gear
carb
Mazda RX4   21.0   6 160.0 110 3.90 2.620 16.46  0  14
4
Mazda RX4 Wag   21.0   6 160.0 110 3.90 2.875 17.02  0  14
4
Datsun 710  22.8   4 108.0  93 3.85 2.320 18.61  1  14
1
Hornet 4 Drive  21.4   6 258.0 110 3.08 3.215 19.44  1  03
1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  03
2
Valiant 18.1   6 225.0 105 2.76 3.460 20.22  1  03
1
Duster 360  14.3   8 360.0 245 3.21 3.570 15.84  0  03
4
Merc 240D   24.4   4 146.7  62 3.69 3.190 20.00  1  04
2
Merc 23022.8   4 140.8  95 3.92 3.150 22.90  1  04
2
Merc 28019.2   6 167.6 123 3.92 3.440 18.30  1  04
4
Merc 280C   17.8   6 167.6 123 3.92 3.440 18.90  1  04
4
Merc 450SE  16.4   8 275.8 180 3.07 4.070 17.40  0  03
3
Merc 450SL  17.3   8 275.8 180 3.07 3.730 17.60  0  03
3
Merc 450SLC 15.2   8 275.8 180 3.07 3.780 18.00  0  03
3
Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  03
4
Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  03
4
Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  03
4
Fiat 12832.4   4  78.7  66 4.08 2.200 19.47  1  14
1
Honda Civic 30.4   4  75.7  52 4.93 1.615 18.52  1  14
2
Toyota Corolla  33.9   4  71.1  65 4.22 1.835 19.90  1  14
1
Toyota Corona   21.5   4 120.1  97 3.70 2.465 20.01  1  03
1
Dodge Challenger15.5   8 318.0 150 2.76 3.520 16.87  0  03
2
AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  03
2
Camaro Z28  13.3   8 350.0 245 3.73 3.840 15.41  0  03
4
Pontiac Firebird19.2   8 400.0 175 3.08 3.845 17.05  0  03
2
Fiat X1-9   27.3   4  79.0  66 4.08 1.935 18.90  1  14
1
Porsche 914-2   26.0   4 120.3  91 4.43 2.140 16.70  0  15
2
Lotus Europa30.4   4  95.1 113 3.77 1.513 16.90  1  15
2
Ford Pantera L  15.8   8 351.0 264 4.22 3.170 14.50  0  15
4
Ferrari Dino19.7   6 145.0 175 3.62 2.770 15.50  0  15
6
Maserati Bora   15.0   8 301.0 335 3.54 3.570 14.60  0  15
8
Volvo 142E  21.4   4 121.0 109 4.11 2.780 18.60  1  14
2

Example model:

model <- mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear +
carb

Regression:

reg_lm <- lm(model, mtcars)

Results:

> summary(reg_lm)

Call:
lm(formula = model, data = mtcars)

Residuals:
Min  1Q  Median  3Q Max
-3.4506 -1.6044 -0.1196  1.2193  4.6271

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.30337   18.71788   0.657   0.5181
cyl -0.111441.04502  -0.107   0.9161
disp 0.013340.01786   0.747   0.4635
hp  -0.021480.02177  -0.987   0.3350
drat 0.787111.63537   0.481   0.6353
wt  -3.715301.89441  -1.961   0.0633 .
qsec 0.821040.73084   1.123   0.2739
vs   0.317762.10451   0.151   0.8814
am   2.520232.05665   1.225   0.2340
gear 0.655411.49326   0.439   0.6652
carb-0.199420.82875  -0.241   0.8122
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.65 on 21 degrees of freedom
Multiple R-squared: 0.869,  Adjusted R-squared: 0.8066
F-statistic: 13.93 on 10 and 21 DF,  p-value: 3.793e-07

VIF's:

> vif(reg_lm)
  cyl  disphp  dratwt  qsec
vsam  gear  carb
15.373833 21.620241  9.832037  3.374620 15.164887  7.527958  4.965873
4.648487  5.357452  7.908747

Here is the definition of vif() (courtesy of
http://www.stat.sc.edu/~hitchcock/bodyfatRexample.txt):

> vif.lm
function(object, ...) {
  V <- summary(object)$cov.unscaled
  Vi <- crossprod(model.matrix(object))
nam <- names(coef(object)

[R] Running out of memory when importing SPSS files

2009-02-18 Thread dobomode
Hello R-help,

I am trying to import a large dataset from SPSS into R. The SPSS file
is in .SAV format and is about 1GB in size. I use read.spss to import
the file and get an error saying that I have run out of memory. I am
on a MAC OS X 10.5 system with 4GB of RAM. Monitoring the R process
tells me that R runs out of memory when reaching about 3GB of RAM so I
suppose the remaining 1GB is used up by the OS.

Why would a 1GB SPSS file take up more than 3GB of memory in R? Is it
perhaps because R is converting each SPSS column to a less memory-
efficient data type? In general, what is the best strategy to load
large datasets in R?

Thanks!

P.S.

I exported the SPSS .SAV file to .CSV and tried importing the comma
delimited file. Same results – the import was much slower but
eventually I ran out of memory again...

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Questions about biglm

2009-02-18 Thread dobomode
Hello folks,

I am very excited to have discovered R and have been exploring its
capabilities. R's regression models are of great interest to me as my
company is in the business of running thousands of linear regressions
on large datasets.

I am using biglm to run linear regressions on datasets that are as
large as several GB's. I have been pleasantly surprised that biglm
runs the regressions extremely fast (one regression may take minutes
in SPSS vs seconds in R).

I have been trying to wrap my head around biglm and have a couple of
questions.

1. How can I get VIF's (Variance Inflation Factors) using biglm? I was
able to get VIF's from the regular lm function using this piece of
code I found through Google, but have not been able to adapt it to
work with biglm. Hasn't anyone been successful in this?

vif.lm <- function(object, ...) {
  V <- summary(object)$cov.unscaled
  Vi <- crossprod(model.matrix(object))
nam <- names(coef(object))
  if(k <- match("(Intercept)", nam, nomatch = F)) {
v1 <- diag(V)[-k]
v2 <- (diag(Vi)[-k] - Vi[k, -k]^2/Vi[k,k])
nam <- nam[-k]
} else {
v1 <- diag(V)
v2 <- diag(Vi)
warning("No intercept term detected. Results may
surprise.")
}
structure(v1*v2, names = nam)
}

2. How reliable / stable is biglm's update() function? I was
experimenting with running regressions on individual chunks of my
large dataset, but the coefficients I got were different compared to
those obtained form running biglm on the whole dataset. Am I mistaken
when I say that update() is intended to run regressions in chunks
(when memory becomes an issue with datasets that are too large) and
produce identical results to running a single regression on the
dataset as a whole?

Thanks!

Dobo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Running out of memory when importing SPSS files

2009-02-18 Thread dobomode
I found the culprit. I had a number of variables in the SPSS file that
were a variable length string data type (255 characters). This seemed
to force R into creating 255-byte variables which eventually choked my
machine's memory...


On Feb 18, 5:34 pm, Uwe Ligges 
wrote:
> dobomodewrote:
> > Hello R-help,
>
> > I am trying to import a large dataset from SPSS into R. The SPSS file
> > is in .SAV format and is about 1GB in size. I use read.spss to import
> > the file and get an error saying that I have run out of memory. I am
> > on a MAC OS X 10.5 system with 4GB of RAM. Monitoring the R process
> > tells me that R runs out of memory when reaching about 3GB of RAM so I
> > suppose the remaining 1GB is used up by the OS.
>
> > Why would a 1GB SPSS file take up more than 3GB of memory in R?
>
> Because SPSS stores data in a compressed way?
>
>  > Is it
>
> > perhaps because R is converting each SPSS column to a less memory-
> > efficient data type? In general, what is the best strategy to load
> > large datasets in R?
>
> Use a 64-bit version of R and have sufficient amount of RAM in your system.
>
> Uwe Ligges
>
> > Thanks!
>
> > P.S.
>
> > I exported the SPSS .SAV file to .CSV and tried importing the comma
> > delimited file. Same results – the import was much slower but
> > eventually I ran out of memory again...
>
> > __
> > r-h...@r-project.org mailing list
> >https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> __
> r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Using JRI and Java 1.6 on MAC OS X

2009-02-28 Thread dobomode
Dear R-Help,

I am trying to get JRI (the rJava interface allowing Java to connect
to R) to work. I was able to run it a week ago when I was doing some
testing using Java 1.5. However, I am developing a GUI application
using some of the new Java 1.6 features and I just can't get JRI to
work with this setup.

Here is what I get:

Cannot find JRI native library!
Please make sure that the JRI native library is in a directory listed
in java.library.path.

java.lang.UnsatisfiedLinkError: /Library/Frameworks/R.framework/
Versions/2.8/Resources/library/rJava/jri/libjri.jnilib:
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(ClassLoader.java:1822)
at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1739)
at java.lang.Runtime.loadLibrary0(Runtime.java:823)
at java.lang.System.loadLibrary(System.java:1030)
at org.rosuda.JRI.Rengine.(Rengine.java:9)
at mfa.mes.gui.MESFrame.initR(MESFrame.java:79)
at mfa.mes.gui.MESFrame.(MESFrame.java:313)
at mfa.mes.MES.main(MES.java:131)
Java Result: 1

Notice that it did actually find the JRI.jar library. The error seems
to be related to the native JNI. I have set my java library path and
R_HOME correctly.

I saw this entry in the changelog for rJava:

0.4-10  2006-09-14
o   Removed obsolete JNI 1.1 support that is no longer provided
in JDK 1.6 and thus prevented rJava from being used with JDK
1.6

I am curious if this change has been applied to JRI as well. It would
be very unfortunate if JRI is incompatible with the latest JDK.

I am running NetBeans / JDK 1.6 on Mac OS X 10.5.6.

Any help would be greatly appreciated!

Dobo

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.