Hi all,
I have to run a logit regresion over a large dataset and I am not sure
about the best option to do it. The dataset is about 200000x2000 and R
runs out of memory when creating it.
After going over help archives and the mailing lists, I think there are
two main options, though I am not sure about which one will be better.
Of course, any alternative will be welcome as well.
Actually, I am not quite sure about whether any of these options will
work but, before getting into it, I would like to get some advice.
-A first option is to use the package ff, that allows to work with the
dataset without loading it into the RAM. This, combined with the bigglm
function should do the job.
-The dataset contains a lot of sparse variables, so I was wondering
whether creating the model matrix as a sparse matrix might deliver good
results. In this case, I am not sure about the capabilities of glm or
some extension of it to deal with sparse matrices (I could not find any
documentation about this). If possible, this second option seems more
efficient since R might be capable of using the fact that matrices are
sparse to speed up the computations.
Thanks in advance.
All the best!
Julio.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.