On Dec 14, 2007 1:01 PM, Barry Rowlingson <[EMAIL PROTECTED]> wrote: > I have some code that can potentially produce a huge number of > large-ish R data frames, each of a different number of rows. All the > data frames together will be way too big to keep in R's memory, but > we'll assume a single one is manageable. It's just when there's a > million of them that the machine might start to burn up. > > However I might, for example, want to compute some averages over the > elements in the data frames. Or I might want to sample ten of them at > random and do some plots. What I need is rapid random access to data > stored in external files. > > Here's some ideas I've had: > > * Store all the data in an HDF-5 file - problem here is that the > current HDF package for R reads the whole file in at once. > > * Store the data in some other custom binary format with an index for > rapid access to the N-th elements. Problems: feels like reinventing HDF, > cross-platform issues, etc. > > * Store the data in a number of .RData files in a directory. Hence to > get the N-th element just attach(paste("foo/A-",n,'.RData')) give or > take a parameter or two. > > * Use a database. Seems a bit heavyweight, but maybe using RSQLite > could work in order to keep it local. >
Unless you really need this to be a general solution, I would suggest using a database. And if you use one that allows you to create functions within it, you can even keep some of the calculations on the server side (which may be a performance advantage). If you are doing a lot of this, you might consider Postgres and pl/R, which embeds R in the database. Sean [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel