This is where the sort/merge application on the mainframe has excelled for the last 40 years. If you can not send it to a mainframe, you can look at the SyncSort package that runs on UNIX machines.
On Mon, Jul 30, 2012 at 12:25 PM, Matthew Keller <mckellerc...@gmail.com> wrote: > Hello all, > > I have some genetic datasets (gzipped) that contain 6 columns and > upwards of 10s of billions of rows. The largest dataset is about 16 GB > on file, gzipped (!). I need to sort them according to columns 1, 2, > and 3. The setkey() function in the data.table package does this > quickly, but of course we're limited by R not being able to index > vectors with > 2^31 elements, and bringing in only the parts of the > dataset we need is not applicable here. > > I'm asking for practical advice from people who've done this or who > have ideas. We'd like to be able to sort the biggest datasets in hours > rather than days (or weeks!). We cannot have any process take over 50 > GB RAM max (we'd prefer smaller so we can parallelize). . > > Relational databases seem too slow, but maybe I am wrong. A quick look > at the bigmemory package doesn't turn up an ability to sort like this, > but again, maybe I'm wrong. My computer programmer writes in C++, so > if you have ideas in C++, that works too. > > Any help would be much appreciated... Thanks! > > Matt > > > -- > Matthew C Keller > Asst. Professor of Psychology > University of Colorado at Boulder > www.matthewckeller.com > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? Tell me what you want to do, not how you want to do it. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.