Thank you Jim for your kind reply. My intention was to split one 14M file into less than 15 text files, each of them having ~1M lines. The idea was to make sure that one "sequence"
GG!KK!KK! --sequence start APE!KKU!684! APE!VAL!! APE!UASU!! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! --sequence end does not break into parts between those files so that e.g at the end of the first file (containing ~1M lines) has ... GG!KK!KK! --sequence start APE!KKU!684! APE!VAL!! APE!UASU!! --no sequence end here! and the beginning of the second file --no sequence start here! APE!PLA!1! APE!E!10! APE!TPVA!17122009! APE!STAP!1! GG!KK!KK! --sequence end ... -J 2011/10/18 jim holtman <jholt...@gmail.com>: > I thought that you wanted a separate file for each of the breaks > "GG!KK!KK!". If you want to read in some large number of lines and > then break them so that they have that many lines, you can do the same > thing, except scanning from the back for a break. So if your input > file has 14M breaks in it, then the code I sent would create that many > files. If you want a minimum number of lines per file, including the > breaks, then it can be done. You just have to be clearer on exactly > what the requirement are. From your sample data, it looks like there > were 7 text lines per record, so if your input was 14M lines, I would > expect that you would have something in the neighborhood of 1.8M files > with 7 lines each. If you had 14M lines in the file and you were > generating 14M files, then there is something wrong with your code is > that it is not recognizing the breaks. How many lines did each file > have in it? > > On Tue, Oct 18, 2011 at 9:36 AM, johannes rara <johannesr...@gmail.com> wrote: >> Thanks Jim for your help. I tried this code using readLines and it >> works but not in way I wanted. It seems that this code is trying to >> separate all records from a text file so that I'm getting over 14 000 >> 000 text files. My intention is to get only 15 text files all expect >> one containing 1 000 000 rows so that the record which is on the >> breakpoint (near at 1 000 000 line) does not cut from the "middle"... >> >> -J >> >> 2011/10/18 jim holtman <jholt...@gmail.com>: >>> Use 'readLines' instead of 'read.table'. We want to read in the text >>> file and convert it into separate text files, each of which can then >>> be read in using 'read.table'. My solution assumes that you have used >>> readLines. Trying to do this with data frames gets messy. Keep it >>> simple and do it in two phases; makes it easier to debug and to see >>> what is going on. >>> >>> >>> >>> On Tue, Oct 18, 2011 at 8:57 AM, johannes rara <johannesr...@gmail.com> >>> wrote: >>>> Thanks Jim, >>>> >>>> I tried to convert this solution into my situation (.txt file as an input); >>>> >>>> zz <- file("myfile.txt", "r") >>>> >>>> fileNo <- 1 # used for file name >>>> buffer <- NULL >>>> repeat{ >>>> input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', >>>> row.names=NULL, na.strings="") >>>> if (length(input) == 0) break # done >>>> buffer <- c(buffer, input) >>>> # find separator >>>> repeat{ >>>> indx <- which(grepl("^GG!KK!KK!", buffer))[1] >>>> if (is.na(indx)) break # not found yet; read more >>>> writeLines(buffer[1:(indx - 1L)] >>>> , sprintf("newFile%04d.txt", fileNo) >>>> ) >>>> buffer <- buffer[-c(1:indx)] # remove data >>>> fileNo <- fileNo + 1 >>>> } >>>> } >>>> >>>> but it gives me an error >>>> >>>> Error in read.table(file = file, header = header, sep = sep, quote = >>>> quote, : >>>> no lines available in input >>>>> >>>> >>>> Do you know a reason for this? >>>> >>>> -J >>>> >>>> 2011/10/18 jim holtman <jholt...@gmail.com>: >>>>> Let's do it in two parts: first create all the separate files (which >>>>> if this what you are after, we can stop here). You can change the >>>>> value on readLines to read in as many lines as you want; I set it to 2 >>>>> just for testing. >>>>> >>>>> x <- textConnection("APE!KKU!684! >>>>> APE!VAL!! >>>>> APE!UASU!! >>>>> APE!PLA!1! >>>>> APE!E!10! >>>>> APE!TPVA!17122009! >>>>> APE!STAP!1! >>>>> GG!KK!KK! >>>>> APE!KKU!684! >>>>> APE!VAL!! >>>>> APE!UASU!! >>>>> APE!PLA!1! >>>>> APE!E!10! >>>>> APE!TPVA!17122009! >>>>> APE!STAP!1! >>>>> GG!KK!KK! >>>>> APE!KKU!684! >>>>> APE!VAL!! >>>>> APE!UASU!! >>>>> APE!PLA!1! >>>>> APE!E!10! >>>>> APE!TPVA!17122009! >>>>> APE!STAP!1! >>>>> GG!KK!KK!") >>>>> >>>>> fileNo <- 1 # used for file name >>>>> buffer <- NULL >>>>> repeat{ >>>>> input <- readLines(x, n = 100) >>>>> if (length(input) == 0) break # done >>>>> buffer <- c(buffer, input) >>>>> # find separator >>>>> repeat{ >>>>> indx <- which(grepl("^GG!KK!KK!", buffer))[1] >>>>> if (is.na(indx)) break # not found yet; read more >>>>> writeLines(buffer[1:(indx - 1L)] >>>>> , sprintf("newFile%04d", fileNo) >>>>> ) >>>>> buffer <- buffer[-c(1:indx)] # remove data >>>>> fileNo <- fileNo + 1 >>>>> } >>>>> } >>>>> >>>>> >>>>> On Tue, Oct 18, 2011 at 8:12 AM, johannes rara <johannesr...@gmail.com> >>>>> wrote: >>>>>> I have a data set like this in one .txt file (cols separated by !): >>>>>> >>>>>> APE!KKU!684! >>>>>> APE!VAL!! >>>>>> APE!UASU!! >>>>>> APE!PLA!1! >>>>>> APE!E!10! >>>>>> APE!TPVA!17122009! >>>>>> APE!STAP!1! >>>>>> GG!KK!KK! >>>>>> APE!KKU!684! >>>>>> APE!VAL!! >>>>>> APE!UASU!! >>>>>> APE!PLA!1! >>>>>> APE!E!10! >>>>>> APE!TPVA!17122009! >>>>>> APE!STAP!1! >>>>>> GG!KK!KK! >>>>>> APE!KKU!684! >>>>>> APE!VAL!! >>>>>> APE!UASU!! >>>>>> APE!PLA!1! >>>>>> APE!E!10! >>>>>> APE!TPVA!17122009! >>>>>> APE!STAP!1! >>>>>> GG!KK!KK! >>>>>> >>>>>> it contains over 14 000 000 records. Now because I'm out of memory >>>>>> when trying to handle this data in R, I'm trying to read it >>>>>> sequentially and write it out in several .csv files (or .RData files) >>>>>> and then read these into R one-by-one. One record in this data is >>>>>> between lines GG!KK!KK! and GG!KK!KK!. I tried to implement Jim >>>>>> Holtman's approach >>>>>> (http://tolstoy.newcastle.edu.au/R/e6/help/09/03/8416.html) but the >>>>>> problem is how to avoid cutting one record from the middle? I mean >>>>>> that if I put nrows = 1000000, I don't know if one record (between >>>>>> marks GG!KK!KK! and GG!KK!KK! is ending up in two files). How to avoid >>>>>> that? My code so far: >>>>>> >>>>>> zz <- file("myfile.txt", "r") >>>>>> fileNo <- 1 >>>>>> repeat{ >>>>>> >>>>>> gotError <- 1 # set to 2 if there is an error # catch the >>>>>> error if not more data >>>>>> tryCatch(input <- read.csv(zz, as.is=T, nrows=1000000, sep='!', >>>>>> row.names=NULL, na.strings="", header=FALSE), >>>>>> error=function(x) gotError <<- 2) >>>>>> >>>>>> if (gotError == 2) break >>>>>> # save the intermediate data >>>>>> save(input, file=sprintf("file%03d.RData", fileNo)) >>>>>> fileNo <- fileNo + 1 >>>>>> } >>>>>> close(zz) >>>>>> >>>>>> ______________________________________________ >>>>>> R-help@r-project.org mailing list >>>>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>>>> PLEASE do read the posting guide >>>>>> http://www.R-project.org/posting-guide.html >>>>>> and provide commented, minimal, self-contained, reproducible code. >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Jim Holtman >>>>> Data Munger Guru >>>>> >>>>> What is the problem that you are trying to solve? >>>>> >>>> >>> >>> >>> >>> -- >>> Jim Holtman >>> Data Munger Guru >>> >>> What is the problem that you are trying to solve? >>> >> > > > > -- > Jim Holtman > Data Munger Guru > > What is the problem that you are trying to solve? > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.