Some possibilities using existing tools. If you create a file connection and open it before reading from it (or writing to it), then functions like read.table and read.csv ( and write.table for a writable connection) will read from the connection, but not close and reset it. This means that you could open 2 files, one for reading and one for writing, then read in a chunk, process it, write it out, then read in the next chunk, etc.
Another option would be to read the data into an ff object (ff package) or into a database (SQLite for one) which could have the data accessed in chunks, possibly even in parallel. On Mon, Jun 3, 2013 at 4:59 PM, ivo welch <ivo.we...@anderson.ucla.edu>wrote: > dear R wizards--- > > I presume this is a common problem, so I thought I would ask whether > this solution already exists and if not, suggest it. say, a user has > a data set of x GB, where x is very big---say, greater than RAM. > fortunately, data often come sequentially in groups, and there is a > need to process contiguous subsets of them and write the results to a > new file. read.csv and write.csv only work on FULL data sets. > read.csv has the ability to skip n lines and read only m lines, but > this can cross the subsets. the useful solution here would be a > "filter" function that understands about chunks: > > filter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ... > > a chunk would not exactly be a factor, because normal R factors can be > non-sequential in the data frame. the filter.csv makes it very simple > to work on large data sets...almost SAS simple: > > filter.csv( pipe('bzcat infile.csv.bz2'), "results.csv", "date", > function(d) colMeans(d)) > or > filter.csv( pipe('bzcat infile.csv.bz2'), pipe("bzip -c > > results.csv.bz2"), "date", function(d) d[ unique(d$date), ] ) ## > filter out obserations that have the same date again later > > or some reasonable variant of this. > > now that I can have many small chunks, it would be nice if this were > threadsafe, so > > mcfilter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ... > > with 'library(parallel)' could feed multiple cores the FUNprocess, and > make sure that the processes don't step on one another. (why did R > not use a dot after "mc" for parallel lapply?) presumably, to keep it > simple, mcfilter.csv would keep a counter of read chunks and block > write chinks until the next sequential chunk in order arrives. > > just a suggestion... > > /iaw > > ---- > Ivo Welch (ivo.we...@gmail.com) > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Gregory (Greg) L. Snow Ph.D. 538...@gmail.com [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.