Hi, On Fri, Oct 5, 2012 at 1:41 PM, Ista Zahn <istaz...@gmail.com> wrote: > On Fri, Oct 5, 2012 at 12:09 PM, PIKAL Petr <petr.pi...@precheza.cz> wrote: [snip] >> If I compute correctly, such a big matrix (20e6*1000) needs about 160 GB >> just to be in memory. Are you prepared for this? > > This is not as outrageous as one might think -- you can get a mac pro > with 32 gigs of memory for around $3,500
And even so, I suspect the matrices that will be worked with are sparse, so you can get more savings there (although I'm not sure which of the packages the OP had listed work with sparse input. That having been said, if you don't want to sample from your data, sometimes R isn't the best solution. There are projects being developed to specifically deal with such big data. For one, you might consider looking at the graphlab/graphchi stuff: http://graphlab.org (Graphchi is meant to process big data on a "modest" machine). If you go to the "Toolkits" menu, you'll see they have an implementation of kmeans++ clustering that might be suitable for your clustering analysis (perhaps some matrix factorizations are useful here, too -- perhaps your "market basket" data can be viewed as some type of collaborative filtering problem, in which case their collaborative filtering toolkit is right up your alley ;-) The OP also mentioned classification trees. Perhaps rf-ace might be useful: http://code.google.com/p/rf-ace/ >From their website: """ RF-ACE implements both Random Forest (RF) and Gradient Boosting Tree (GBT) algorithms, and is strongly related to ACE, originally outlined in http://jmlr.csail.mit.edu/papers/volume10/tuv09a/tuv09a.pdf """ If you scroll down to the "case study" section of their main page, you can there is some talk about how they used this in a distributed manner ... perhaps it is applicable in your case as well (in which case you might be able to rig up AWS to help you). HTH, -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.