I have a data set like this in one .txt file (cols separated by !):

APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!
APE!KKU!684!
APE!VAL!!
APE!UASU!!
APE!PLA!1!
APE!E!10!
APE!TPVA!17122009!
APE!STAP!1!
GG!KK!KK!

it contains over 14 000 000 records. Now because I'm out of memory
when trying to handle this data in R, I'm trying to read it
sequentially and write it out in several .csv files (or .RData files)
and then read these into R one-by-one. One record in this data is
between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim
Holtman's approach
(http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the
problem is how to avoid cutting one record from the middle? I mean
that if I put nrows = 1000000, I don't know if one record (between
marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid
that? My code so far:

zz <- file("myfile.txt", "r")
fileNo <- 1
repeat{

    gotError <- 1 # set to 2 if there is an error     # catch the
error if not more data
    tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!',
row.names=NULL, na.strings="", header=FALSE),
              error=function(x) gotError <<- 2)

    if (gotError == 2) break
    # save the intermediate data
    save(input, file=sprintf("file%03d.RData", fileNo))
    fileNo <- fileNo + 1
}
close(zz)

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to