On Sat, Feb 16, 2013 at 2:32 PM, Hadley Wickham <h.wick...@gmail.com> wrote:
> > The subset table isn't a copy of the subset, it contains the unique key > and > > an indicator column showing whether the element is in the subset. I need > > this even if the subset is never modified, so that I can join it to the > main > > table and use it in SQL 'where' conditions to get computations for the > right > > subset of the data. > > Cool - Is that faster than storing a column that just contains the > include indices? > I haven't tried -- I'm writing this so it doesn't modify database tables, partly for safety and partly to reduce the privileges it needs to run. > > The whole point of this new sqlsurvey package is that most of the > > aggregation operations happen in the database rather than in R, which is > > faster for very large data tables. The use case is things like the > American > > Community Survey and the Nationwide Emergency Department Subsample, with > > millions or tens of millions of records and quite a lot of variables. At > > this scale, loading stuff into memory isn't feasible on commodity > desktops > > and laptops, and even on computers with enough memory, the database > > (MonetDB) is faster. > > Have you done any comparisons of monetdb vs sqlite - I'm interested to > know how much faster it is. I'm working on a package > (https://github.com/hadley/dplyr) that compiles R data manipulation > expressions into (e.g. SQL), and have been wondering if it's worth > considering a column-store like monetdb. It's enormously faster than SQLite for databases slightly larger than physical memory. I don't have measurements, but I started this project using SQLite and it just wasn't fast enough to be worthwhile. My guess is that for the JOIN and SELECT SUM() ... GROUP BY operations I'm using it's perhaps ten times faster. For moderate-sized databases it's competitive with analysis in memory even if you ignore the data loading time. For example, using a data set already in memory, with 18000 records and 96 variables: > system.time(svymean(~BPXSAR+BPXDAR,subset(dhanes,RIAGENDR==2))) user system elapsed 0.09 0.01 0.10 Using MonetDB > system.time(svymean(~bpxsar+bpxdar,se=TRUE,subset(sqhanes,riagendr==2))) user system elapsed 0.020 0.001 0.108 -thomas -- Thomas Lumley Professor of Biostatistics University of Auckland [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel