I have a data set like this in one .txt file (cols separated by !): APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK!
it contains over 14 000 000 records. Now because I'm out of memory when trying to handle this data in R, I'm trying to read it sequentially and write it out in several .csv files (or .RData files) and then read these into R one-by-one. One record in this data is between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim Holtman's approach (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the problem is how to avoid cutting one record from the middle? I mean that if I put nrows = 1000000, I don't know if one record (between marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid that? My code so far: zz <- file("myfile.txt", "r") fileNo <- 1 repeat{ gotError <- 1 # set to 2 if there is an error # catch the error if not more data tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', row.names=NULL, na.strings="", header=FALSE), error=function(x) gotError <<- 2) if (gotError == 2) break # save the intermediate data save(input, file=sprintf("file%03d.RData", fileNo)) fileNo <- fileNo + 1 } close(zz) ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.