Take a look at the High-Performance and Parallel Computing with R CRAN Task View:
http://cran.us.r-project.org/web/views/HighPerformanceComputing.html specifically at the section labeled "Large memory and out-of-memory data". There are some specific R features that have been implemented in a fashion to enable out of memory operations, but not all. I believe that Revolution's commercial version of R, has developed 'big data' functionality, but would defer to them for additional details. You can of course use a 64 bit version of R on a 64 bit OS to increase accessible RAM, however, there will still be object size limitations predicated upon the fact that R uses 32 bit signed integers for indexing into objects. See ?"Memory-limits" for more information. HTH, Marc Schwartz On May 25, 2011, at 8:49 AM, Roman Naumenko wrote: > Thanks Jonathan. > > I'm already using RMySQL to load data for couple of days. > I wanted to know what are the relevant R capabilities if I want to process > much bigger tables. > > R always reads the whole set into memory and this might be a limitation in > case of big tables, correct? > Doesn't it use temporary files or something similar to deal such amount of > data? > > As an example I know that SAS handles sas7bdat files up to 1TB on a box with > 76GB memory, without noticeable issues. > > --Roman > > ----- Original Message ----- > >> In cases where I have to parse through large datasets that will not >> fit into R's memory, I will grab relevant data using SQL and then >> analyze said data using R. There are several packages designed to do >> this, like [1] and [2] below, that allow you to query a database >> using >> SQL and end up with that data in an R data.frame. > >> [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html >> [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html > >> On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko >> <ro...@bestroman.com> wrote: >>> Hi R list, >>> >>> I'm new to R software, so I'd like to ask about it is capabilities. >>> What I'm looking to do is to run some statistical tests on quite >>> big >>> tables which are aggregated quotes from a market feed. >>> >>> This is a typical set of data. >>> Each day contains millions of records (up to 10 non filtered). >>> >>> 2011-05-24 750 Bid DELL 14130770 400 >>> 15.4800 BATS 35482391 Y 1 1 0 0 >>> 2011-05-24 904 Bid DELL 14130772 300 >>> 15.4800 BATS 35482391 Y 1 0 0 0 >>> 2011-05-24 904 Bid DELL 14130773 135 >>> 15.4800 BATS 35482391 Y 1 0 0 0 >>> >>> I'll need to filter it out first based on some criteria. >>> Since I keep it mysql database, it can be done through by query. >>> Not >>> super efficient, checked it already. >>> >>> Then I need to aggregate dataset into different time frames (time >>> is >>> represented in ms from midnight, like 35482391). >>> Again, can be done through a databases query, not sure what gonna >>> be faster. >>> Aggregated tables going to be much smaller, like thousands rows per >>> observation day. >>> >>> Then calculate basic statistic: mean, standard deviation, sums etc. >>> After stats are calculated, I need to perform some statistical >>> hypothesis tests. >>> >>> So, my question is: what tool faster for data aggregation and >>> filtration >>> on big datasets: mysql or R? >>> >>> Thanks, >>> --Roman N. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.