On 09/04/2010 7:38 PM, Matthew Keller wrote:
Hi all,
My institute will hopefully be working on cutting-edge genetic
sequencing data by the Fall of 2010. The datasets will be 10's of GB
large and growing. I'd like to use R to do primary analyses. This is
OK, because we can just throw $ at the problem and get lots of RAM
running on 64 bit R. However, we are still running up against the fact
that vectors in R cannot contain more than 2^31-1. I know there are
"ways around" this issue, and trust me, I think I've tried them all
(e.g., bringing in portions of the data at a time; using large-dataset
packages in R; using SQL databases, etc). But all these 'solutions'
are, at the end of the day, much much more cumbersome,
programming-wise, than just doing things in native R. Maybe that's
just the cost of doing what I'm doing. But my questions, which may
well be naive (I'm not a computer programmer), are:
1) Is there an *inherent* limit to vectors being < 2^31-1 long? I.e.,
in an alternative history of R's development, would it have been
feasible for R to not have had this limitation?
The problem is that we use "int" as a vector index. On most platforms,
that's a signed 32 bit integer, with max value 2^31-1.
2) Is there any possibility that this limit will be overcome in future
revisions of R?
Of course, R is open source. You could rewrite all of the internal code
tomorrow to use 64 bit indexing.
Will someone else do it for you? Even that is possible. One problem
are that this will make all of your data incompatible with older
versions of R. And back to the original question: are you willing to
pay for the development? Then go ahead, you can have it tomorrow (or
later, if your budget is limited). Are you waiting for someone else to
do it for free? Then you need to wait for someone who knows how to do
it to want to do it.
I'm very very grateful to the people who have spent important parts of
their professional lives developing R. I don't think anyone back in,
say, 1995, could have foreseen that datasets would be >>2^32-1 in
size. For better or worse, however, in many fields of science, that is
routinely the case today. *If* it's possible to get around this limit,
then I'd like to know whether the R Development Team takes seriously
the needs of large data users, or if they feel that (perhaps not
mutually exclusively) developing such capacity is best left up to ad
hoc R packages and alternative analysis programs.
There are many ways around the limit today. Put your data in a
dataframe with many columns each of length 2^31-1 or less. Put your
data in a database, and process it a block at a time. Etc.
Duncan Murdoch
Best,
Matt
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.