Hi everyone,

 

Is there a way to take an lm() model and strip it to a minimal form (or
convert it to another type of object) that can still used to predict the
dependent variable?

 

Background:  I have a series of 6 lm() models, each of which are being
run on the same data frame of approximately 500,000 rows.  I eventually
want to predict all 6 of the dependent variables on a new data frame,
which also has several hundred thousand rows.  

 

E.g. I need to create:

 

lm.1 <- lm(y1 ~ x1 + f + g,data=my.500k.rows)

lm.2 <- lm(y2 ~ x2 + f + g,data=my.500k.rows)

...

lm.6 <- lm(y6 ~ x6 + f + g,data=my.500k.rows)

 

and then predict y1 ... y6 for another large data set

 

predict (lm.1, newdata=another.500k.rows)

predict (lm.2, newdata=another.500k.rows)

...

predict (lm.6, newdata=another.500k.rows)

 

Because of the size of the input data frame, the individual model
objects are quite large.  Through some probably ill-advised tinkering,
I've found that I can strip some of the larger components out of the
models (e.g. residuals, effects, fitted.values), and still be able to
use predict() to generate predicted values on new data.  

 

However, the "qr" component can not be removed, and using qr=FALSE in
the call to lm() makes calling predict() on the resulting model object
return all zeroes.  The qr matrix seems to consume an amount of memory
proportional to the input data.  On my system (Windows XP, running R
2.7.1 with /3GB and -max-mem-size=2899M enabled), this means storing 6
such models simultaneously is impossible as I get "unable to allocate
vector of size <N>" errors during processing.

 

If I'm not mistaken, one could, in principle, generate the predicted
dependent variable values using just the coefficients (and perhaps a
couple of other pieces of metadata about such things as the model
formula and factors), but I'm not sure if there's a straightforward way
to do so in R.

 

The example below shows what I am trying to do conceptually, with just
one model instead of six.

 

#########################

 

pop.size <- 500    # actual data size closer to 500,000

 

f.fac <- as.factor(c("A","B","C","D"))

g.fac <- as.factor(c("W","X","Y","Z"))

 

my.data <- data.frame(

     f=f.fac[sample(1:4,pop.size,replace=T)]

    ,g=g.fac[sample(1:4,pop.size,replace=T)]

    ,x1<-runif(pop.size,0,1)

    ,x2<-runif(pop.size,0,1)

    )

my.data$y1<-x1*rnorm(pop.size,-5,5)

my.data$y2<-x2*rnorm(pop.size,-5,5)

 

# Create model - tried using qr=FALSE, but it made prediction later on
fail

lm.1 <- lm(y1~x1+f+g,data=my.data)

 

# Show sizes of the different component of the model object

object.size(lm.1)

do.call("rbind",lapply(names(lm.1),function(x){list(name=x,size=object.s
ize(lm.1[x]))}))

 

# Create new data we want predictions for

my.predict <- data.frame(

     f=f.fac[sample(1:4,pop.size,replace=T)]

    ,g=g.fac[sample(1:4,pop.size,replace=T)]

    ,x1<-runif(pop.size,0,1)

    ,x2<-runif(pop.size,0,1)

    )

 

# Predict using standard R functionality.  This works, but because the

# model objects are so large, can't hold all of them in memory

 

predictions <- predict(lm.1, newdata=my.predict)

 

# Pretend we have a magic function that creates minimally-sized model
object

# from coefficients that can still be used to predict value, but takes
up far less memory

# than standard lm object

 

lm.compact.1 <- compactify(lm.1)

 

# Goal: be able to generate same predictions as with standard methods,

# but with more compact model object

 

predictions <- predict(lm.compact.1, newdata=my.predict)

 

#########################

 

One of my initial thoughts would be to somehow automatically create a
function from the model specification and coefficients that could then
be used to generate predicted values:

 

#  Internally, this would create a function that conceptually looks
like:

#  function (x1, f, g)  { coef.x1 *x1 + coef.f * f + coef.g * g }

 

lm.1.func <- model.to.function ( coef(lm.1),formula(lm.1) )

 

predictions <- lm.1.func(my.predict$x1, my.predict.f, my.predict$g)

 

However, I'm open to any approach that allows predicted values to be
generated while consuming significantly less memory.  I've searched the
list archives and seen references to the "biglm" package, but that
appears to be intended for dealing with input data that is larger than
the system's memory can hold, rather than keeping the resulting model
object size to a minimum.

 

Thanks in advance for any guidance.

 

Keith 

 


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to