Thanks Jim, I tried to convert this solution into my situation (.txt file as an input);
zz <- file("myfile.txt", "r") fileNo <- 1 # used for file name buffer <- NULL repeat{ input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', row.names=NULL, na.strings="") if (length(input) == 0) break # done buffer <- c(buffer, input) # find separator repeat{ indx <- which(grepl("^GG!KK!KK!", buffer))[1] if (is.na(indx)) break # not found yet; read more writeLines(buffer[1:(indx - 1L)] , sprintf("newFile%04d.txt", fileNo) ) buffer <- buffer[-c(1:indx)] # remove data fileNo <- fileNo + 1 } } but it gives me an error Error in read.table(file = file, header = header, sep = sep, quote = quote, : no lines available in input > Do you know a reason for this? -J 2011/10/18 jim holtman <jholt...@gmail.com>: > Let's do it in two parts: first create all the separate files (which > if this what you are after, we can stop here). You can change the > value on readLines to read in as many lines as you want; I set it to 2 > just for testing. > > x <- textConnection("APE!KKU!684! > APE!VAL!! > APE!UASU!! > APE!PLA!1! > APE!E!10! > APE!TPVA!17122009! > APE!STAP!1! > GG!KK!KK! > APE!KKU!684! > APE!VAL!! > APE!UASU!! > APE!PLA!1! > APE!E!10! > APE!TPVA!17122009! > APE!STAP!1! > GG!KK!KK! > APE!KKU!684! > APE!VAL!! > APE!UASU!! > APE!PLA!1! > APE!E!10! > APE!TPVA!17122009! > APE!STAP!1! > GG!KK!KK!") > > fileNo <- 1 # used for file name > buffer <- NULL > repeat{ > input <- readLines(x, n = 100) > if (length(input) == 0) break # done > buffer <- c(buffer, input) > # find separator > repeat{ > indx <- which(grepl("^GG!KK!KK!", buffer))[1] > if (is.na(indx)) break # not found yet; read more > writeLines(buffer[1:(indx - 1L)] > , sprintf("newFile%04d", fileNo) > ) > buffer <- buffer[-c(1:indx)] # remove data > fileNo <- fileNo + 1 > } > } > > > On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesr...@gmail.com> wrote: >> I have a data set like this in one .txt file (cols separated by !): >> >> APE!KKU!684! >> APE!VAL!! >> APE!UASU!! >> APE!PLA!1! >> APE!E!10! >> APE!TPVA!17122009! >> APE!STAP!1! >> GG!KK!KK! >> APE!KKU!684! >> APE!VAL!! >> APE!UASU!! >> APE!PLA!1! >> APE!E!10! >> APE!TPVA!17122009! >> APE!STAP!1! >> GG!KK!KK! >> APE!KKU!684! >> APE!VAL!! >> APE!UASU!! >> APE!PLA!1! >> APE!E!10! >> APE!TPVA!17122009! >> APE!STAP!1! >> GG!KK!KK! >> >> it contains over 14 000 000 records. Now because I'm out of memory >> when trying to handle this data in R, I'm trying to read it >> sequentially and write it out in several .csv files (or .RData files) >> and then read these into R one-by-one. One record in this data is >> between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim >> Holtman's approach >> (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the >> problem is how to avoid cutting one record from the middle? I mean >> that if I put nrows = 1000000, I don't know if one record (between >> marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid >> that? My code so far: >> >> zz <- file("myfile.txt", "r") >> fileNo <- 1 >> repeat{ >> >> gotError <- 1 # set to 2 if there is an error # catch the >> error if not more data >> tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', >> row.names=NULL, na.strings="", header=FALSE), >> error=function(x) gotError <<- 2) >> >> if (gotError == 2) break >> # save the intermediate data >> save(input, file=sprintf("file%03d.RData", fileNo)) >> fileNo <- fileNo + 1 >> } >> close(zz) >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > > > -- > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.