Given a large data.frame, a function trains a series of models by looping over 
two steps:

1. Create a model-specific subset of the complete training data
2. Train a model on the subset data

The function returns a list of trained models which are later used for 
prediction on test data.

Due to how models and closures work in R, each model contains a lot of data 
that is not necessary for prediction. The space requirements of all models 
combined can become prohibitive if the number of samples in the training data 
and the number of models is sufficiently large:

1. A trained linear model (and other models that follow the same conventions) 
contains the training data itself and related quantities (such as residuals). 
While this is convenient for some kinds of analysis, it negates the space 
saving effect of compacting the training data into the model parameters.

2. Any function created in the loop contains the training data in its enclosing 
environment. For example, a linearising transform defined as 

linearise <- function(x) {
    x^gamma
}

(where gamma is derived from the training data) does not only contain `gamma` 
but other objects in its enclosing environment as well (e.g. intermediate 
computations in the loop). If `linearise` is returned with the model, those 
objects are also returned implicitly.

The first point can be dealt with by removing those components of the model 
which are not necessary for prediction (e.g. model$residuals <- NULL). For the 
second point, more work and care is needed to clean up all enclosing 
environments of created functions (not only `linearise` but also model$terms 
etc.).

I have read that V8's garbage collector avoids this problem by distinguishing 
between local and context variables 

https://stackoverflow.com/questions/5326300/garbage-collection-with-node-js

Can something similar be done in R? Is there a programming technique that is 
less tedious than "manual" cleanup of all enclosing environments?

Thanks,
Christian 

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to