Thanks Jim for your help. I tried this code using readLines and it works but not in way I wanted. It seems that this code is trying to separate all records from a text file so that I'm getting over 14 000 000 text files. My intention is to get only 15 text files all expect one containing 1 000 000 rows so that the record which is on the breakpoint (near at 1 000 000 line) does not cut from the "middle"...
-J 2011/10/18 jim holtman <jholt...@gmail.com>: > Use 'readLines' instead of 'read.table'. We want to read in the text > file and convert it into separate text files, each of which can then > be read in using 'read.table'. My solution assumes that you have used > readLines. Trying to do this with data frames gets messy. Keep it > simple and do it in two phases; makes it easier to debug and to see > what is going on. > > > > On Tue, Oct 18, 2011 at 8:57 AM, johannes rara <johannesr...@gmail.com> wrote: >> Thanks Jim, >> >> I tried to convert this solution into my situation (.txt file as an input); >> >> zz <- file("myfile.txt", "r") >> >> fileNo <- 1 # used for file name >> buffer <- NULL >> repeat{ >> input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', >> row.names=NULL, na.strings="") >> if (length(input) == 0) break # done >> buffer <- c(buffer, input) >> # find separator >> repeat{ >> indx <- which(grepl("^GG!KK!KK!", buffer))[1] >> if (is.na(indx)) break # not found yet; read more >> writeLines(buffer[1:(indx - 1L)] >> , sprintf("newFile%04d.txt", fileNo) >> ) >> buffer <- buffer[-c(1:indx)] # remove data >> fileNo <- fileNo + 1 >> } >> } >> >> but it gives me an error >> >> Error in read.table(file = file, header = header, sep = sep, quote = quote, >> : >> no lines available in input >>> >> >> Do you know a reason for this? >> >> -J >> >> 2011/10/18 jim holtman <jholt...@gmail.com>: >>> Let's do it in two parts: first create all the separate files (which >>> if this what you are after, we can stop here). You can change the >>> value on readLines to read in as many lines as you want; I set it to 2 >>> just for testing. >>> >>> x <- textConnection("APE!KKU!684! >>> APE!VAL!! >>> APE!UASU!! >>> APE!PLA!1! >>> APE!E!10! >>> APE!TPVA!17122009! >>> APE!STAP!1! >>> GG!KK!KK! >>> APE!KKU!684! >>> APE!VAL!! >>> APE!UASU!! >>> APE!PLA!1! >>> APE!E!10! >>> APE!TPVA!17122009! >>> APE!STAP!1! >>> GG!KK!KK! >>> APE!KKU!684! >>> APE!VAL!! >>> APE!UASU!! >>> APE!PLA!1! >>> APE!E!10! >>> APE!TPVA!17122009! >>> APE!STAP!1! >>> GG!KK!KK!") >>> >>> fileNo <- 1 # used for file name >>> buffer <- NULL >>> repeat{ >>> input <- readLines(x, n = 100) >>> if (length(input) == 0) break # done >>> buffer <- c(buffer, input) >>> # find separator >>> repeat{ >>> indx <- which(grepl("^GG!KK!KK!", buffer))[1] >>> if (is.na(indx)) break # not found yet; read more >>> writeLines(buffer[1:(indx - 1L)] >>> , sprintf("newFile%04d", fileNo) >>> ) >>> buffer <- buffer[-c(1:indx)] # remove data >>> fileNo <- fileNo + 1 >>> } >>> } >>> >>> >>> On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesr...@gmail.com> >>> wrote: >>>> I have a data set like this in one .txt file (cols separated by !): >>>> >>>> APE!KKU!684! >>>> APE!VAL!! >>>> APE!UASU!! >>>> APE!PLA!1! >>>> APE!E!10! >>>> APE!TPVA!17122009! >>>> APE!STAP!1! >>>> GG!KK!KK! >>>> APE!KKU!684! >>>> APE!VAL!! >>>> APE!UASU!! >>>> APE!PLA!1! >>>> APE!E!10! >>>> APE!TPVA!17122009! >>>> APE!STAP!1! >>>> GG!KK!KK! >>>> APE!KKU!684! >>>> APE!VAL!! >>>> APE!UASU!! >>>> APE!PLA!1! >>>> APE!E!10! >>>> APE!TPVA!17122009! >>>> APE!STAP!1! >>>> GG!KK!KK! >>>> >>>> it contains over 14 000 000 records. Now because I'm out of memory >>>> when trying to handle this data in R, I'm trying to read it >>>> sequentially and write it out in several .csv files (or .RData files) >>>> and then read these into R one-by-one. One record in this data is >>>> between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim >>>> Holtman's approach >>>> (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the >>>> problem is how to avoid cutting one record from the middle? I mean >>>> that if I put nrows = 1000000, I don't know if one record (between >>>> marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid >>>> that? My code so far: >>>> >>>> zz <- file("myfile.txt", "r") >>>> fileNo <- 1 >>>> repeat{ >>>> >>>> gotError <- 1 # set to 2 if there is an error # catch the >>>> error if not more data >>>> tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', >>>> row.names=NULL, na.strings="", header=FALSE), >>>> error=function(x) gotError <<- 2) >>>> >>>> if (gotError == 2) break >>>> # save the intermediate data >>>> save(input, file=sprintf("file%03d.RData", fileNo)) >>>> fileNo <- fileNo + 1 >>>> } >>>> close(zz) >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >>> >>> >>> -- >>> Jim Holtman >>> Data Munger Guru >>> >>> What is the problem that you are trying to solve? >>> >> > > > > -- > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.