> The subset table isn't a copy of the subset, it contains the unique key and > an indicator column showing whether the element is in the subset. I need > this even if the subset is never modified, so that I can join it to the main > table and use it in SQL 'where' conditions to get computations for the right > subset of the data.
Cool - Is that faster than storing a column that just contains the include indices? > The whole point of this new sqlsurvey package is that most of the > aggregation operations happen in the database rather than in R, which is > faster for very large data tables. The use case is things like the American > Community Survey and the Nationwide Emergency Department Subsample, with > millions or tens of millions of records and quite a lot of variables. At > this scale, loading stuff into memory isn't feasible on commodity desktops > and laptops, and even on computers with enough memory, the database > (MonetDB) is faster. Have you done any comparisons of monetdb vs sqlite - I'm interested to know how much faster it is. I'm working on a package (https://github.com/hadley/dplyr) that compiles R data manipulation expressions into (e.g. SQL), and have been wondering if it's worth considering a column-store like monetdb. Hadley -- Chief Scientist, RStudio http://had.co.nz/ ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel