With PostgreSQL at least, R can also be used as implementation language for stored procedures. Hence data transfers between processes can be avoided alltogether.
http://www.joeconway.com/plr/ Implemention of such a procedure in R appears to be straighforward: CREATE OR REPLACE FUNCTION overpaid (emp) RETURNS bool AS ' if (200000 < arg1$salary) { return(TRUE) } if (arg1$age < 30 && 100000 < arg1$salary) { return(TRUE) } return(FALSE) ' LANGUAGE 'plr'; CREATE TABLE emp (name text, age int, salary numeric(10,2)); INSERT INTO emp VALUES ('Joe', 41, 250000.00); INSERT INTO emp VALUES ('Jim', 25, 120000.00); INSERT INTO emp VALUES ('Jon', 35, 50000.00); SELECT name, overpaid(emp) FROM emp; name | overpaid ------+---------- Joe | t Jim | t Jon | f (3 rows) Best On Wednesday 25 May 2011 14:12:23 Jonathan Daily wrote: > In cases where I have to parse through large datasets that will not > fit into R's memory, I will grab relevant data using SQL and then > analyze said data using R. There are several packages designed to do > this, like [1] and [2] below, that allow you to query a database using > SQL and end up with that data in an R data.frame. > > [1] http://cran.cnr.berkeley.edu/web/packages/RMySQL/index.html > [2] http://cran.cnr.berkeley.edu/web/packages/RSQLite/index.html > > On Wed, May 25, 2011 at 12:29 AM, Roman Naumenko <ro...@bestroman.com> wrote: > > Hi R list, > > > > I'm new to R software, so I'd like to ask about it is capabilities. > > What I'm looking to do is to run some statistical tests on quite big > > tables which are aggregated quotes from a market feed. > > > > This is a typical set of data. > > Each day contains millions of records (up to 10 non filtered). > > > > 2011-05-24 750 Bid DELL 14130770 400 > > 15.4800 BATS 35482391 Y 1 1 0 0 > > 2011-05-24 904 Bid DELL 14130772 300 > > 15.4800 BATS 35482391 Y 1 0 0 0 > > 2011-05-24 904 Bid DELL 14130773 135 > > 15.4800 BATS 35482391 Y 1 0 0 0 > > > > I'll need to filter it out first based on some criteria. > > Since I keep it mysql database, it can be done through by query. Not > > super efficient, checked it already. > > > > Then I need to aggregate dataset into different time frames (time is > > represented in ms from midnight, like 35482391). > > Again, can be done through a databases query, not sure what gonna be faster. > > Aggregated tables going to be much smaller, like thousands rows per > > observation day. > > > > Then calculate basic statistic: mean, standard deviation, sums etc. > > After stats are calculated, I need to perform some statistical > > hypothesis tests. > > > > So, my question is: what tool faster for data aggregation and filtration > > on big datasets: mysql or R? > > > > Thanks, > > --Roman N. > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.