See also bigglm() in package biglm.

On Sat, 9 Aug 2008, Pradheep K E wrote:

Hi R-experts,

Does anyone have experience using R for handling large scale data (millions
of rows, hundreds or thousands of features)?

What is the largest size of data that anyone has used with glm?

I've used 700,000 rows and about 100 cols, but it was 4 years ago and we have more memory now. It matters if the 'features' are numeric or categorical, as the latter can expand to many columns in the model matrix.

As a rough guide, expect to need 200x as much memory in bytes as nrows x ncols. Using glm.fit will be more efficient (I've just tested 100,000 x 100 which used 1.2Gb).

Also, is there a library to read data in sparse data format (like SVMlight
format)?

You mean *store* data in a sparse format when read in? I'm not sure of the relevance, but look at the function method for bigglm for a way to avoid even doing that. If the data are numeric there are at least three sparse-matrix packages on CRAN.

Ultimately R's code such as glm() is designed for flexibility and to do interesting things with the fit: for really large problems you will do better to write a specialized fitting routine. bigglm() is an
intermediate position.

There's also the question of whether there are any interesting homogeneous datasets of this sort of size. Often doing analyses on subsets and a meta-analysis is a much more insightful approach (as it was in our problem: we split on one of the categorical explanatory variables).

Thanks
Pradheep

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Brian D. Ripley,                  [EMAIL PROTECTED]
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to