I have some code that can potentially produce a huge number of
large-ish R data frames, each of a different number of rows. All the
data frames together will be way too big to keep in R's memory, but
we'll assume a single one is manageable. It's just when there's a
million of them that the machine might start to burn up.
However I might, for example, want to compute some averages over the
elements in the data frames. Or I might want to sample ten of them at
random and do some plots. What I need is rapid random access to data
stored in external files.
Here's some ideas I've had:
* Store all the data in an HDF-5 file - problem here is that the
current HDF package for R reads the whole file in at once.
* Store the data in some other custom binary format with an index for
rapid access to the N-th elements. Problems: feels like reinventing HDF,
cross-platform issues, etc.
* Store the data in a number of .RData files in a directory. Hence to
get the N-th element just attach(paste("foo/A-",n,'.RData')) give or
take a parameter or two.
* Use a database. Seems a bit heavyweight, but maybe using RSQLite
could work in order to keep it local.
What I'm currently doing is keeping it OO enough that I can in theory
implement all of the above. At the moment I have an implementation that
does keep them all in R's memory as a list of data frames, which is fine
for small test cases but things are going to get big shortly. Any other
ideas or hints are welcome.
thanks
Barry
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel