[R] Processing large datasets

Roman Naumenko Tue, 24 May 2011 23:54:59 -0700

Hi R list,

I'm new to R software, so I'd like to ask about it is capabilities.
What I'm looking to do is to run some statistical tests on quite big 
tables which are aggregated quotes from a market feed.


This is a typical set of data.
Each day contains millions of records (up to 10 non filtered).

2011-05-24      750     Bid     DELL    14130770        400     
15.4800         BATS    35482391        Y       1       1       0       0
2011-05-24      904     Bid     DELL    14130772        300     
15.4800         BATS    35482391        Y       1       0       0       0
2011-05-24      904     Bid     DELL    14130773        135     
15.4800         BATS    35482391        Y       1       0       0       0

I'll need to filter it out first based on some criteria.
Since I keep it mysql database, it can be done through by query. Not 
super efficient, checked it already.

Then I need to aggregate dataset into different time frames (time is 
represented in ms from midnight, like 35482391).
Again, can be done through a databases query, not sure what gonna be faster.
Aggregated tables going to be much smaller, like thousands rows per 
observation day.

Then calculate basic statistic: mean, standard deviation, sums etc.
After stats are calculated, I need to perform some statistical 
hypothesis tests.

So, my question is: what tool faster for data aggregation and filtration 
on big datasets: mysql or R?

Thanks,
--Roman N.

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Processing large datasets

Reply via email to