> Hi, > On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko > <ro...@bestroman.com> wrote: > > Hi R list, > > > > I'm new to R software, so I'd like to ask about it is capabilities. > > What I'm looking to do is to run some statistical tests on quite > > big > > tables which are aggregated quotes from a market feed. > > > > This is a typical set of data. > > Each day contains millions of records (up to 10 non filtered). > > > > 2011-05-24 750 Bid DELL 14130770 400 > > 15.4800 BATS 35482391 Y 1 1 0 0 > > 2011-05-24 904 Bid DELL 14130772 300 > > 15.4800 BATS 35482391 Y 1 0 0 0 > > 2011-05-24 904 Bid DELL 14130773 135 > > 15.4800 BATS 35482391 Y 1 0 0 0 > > > > I'll need to filter it out first based on some criteria. > > Since I keep it mysql database, it can be done through by query. > > Not > > super efficient, checked it already. > > > > Then I need to aggregate dataset into different time frames (time > > is > > represented in ms from midnight, like 35482391). > > Again, can be done through a databases query, not sure what gonna > > be faster. > > Aggregated tables going to be much smaller, like thousands rows per > > observation day. > > > > Then calculate basic statistic: mean, standard deviation, sums etc. > > After stats are calculated, I need to perform some statistical > > hypothesis tests. > > > > So, my question is: what tool faster for data aggregation and > > filtration > > on big datasets: mysql or R?
> Why not try a few experiments and see for yourself -- I guess the > answer will depend on what exactly you are doing. > If your datasets are *really* huge, check out some packages listed > under the "Large memory and out-of-memory data" section of the > "HighPerformanceComputing" task view at CRAN: > http://cran.r-project.org/web/views/HighPerformanceComputing.html > Also, if you find yourself needing to do lots of > "grouping/summarizing" type of calculations over large data > frame-like objects, you might want to check out the data.table package: > http://cran.r-project.org/web/packages/data.table/index.html > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > | Memorial Sloan-Kettering Cancer Center > | Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact I don't think data.table is fundamentally different from data.frame type, but thanks for the suggestion. http://cran.r-project.org/web/packages/data.table/vignettes/datatable-intro.pdf "Just like data.frames, data.tables must fit inside RAM" The ff package by Adler, listed in "Large memory and out-of-memory data" is probably most interesting. --Roman Naumenko ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.