Here how long it might take in R to do. I created a file of 558MB and then read it it, found lines that had ''76095' in them and then wrote those out:
> system.time(x <- readLines('tempyy')) # read in the 558MB file user system elapsed 65.91 0.82 67.40 > object.size(x) 63348864 bytes > str(x) # 14M lines of data chr [1:14276304] "\"Locationname\",\"N Units\",\"Wskusku\"" ... > system.time(indx <- grepl("76095", x)) # grep for the criteria user system elapsed 10.78 0.02 11.46 > system.time(writeLines(x[indx], 'tempzz')) # write the 1152 matching lines user system elapsed 0.13 0.03 0.23 > sum(indx) [1] 1152 > On Wed, Sep 14, 2011 at 10:06 AM, Rainer Schuermann <rainer.schuerm...@gmx.net> wrote: > That looks like a perfect job for (g)awk which is in every Linux distribution > but also available for Windows. > It can be called with something like > > system( "awk -f script.awk inputfile.txt" ) > > and does its job silently and very fast. 650MB should not be an issue. I'm not > proficient in awk but would offer my help anyway (off-list...). > > Rgds, > Rainer > > > On Wednesday 14 September 2011 13:08:14 Stefan McKinnon Høj-Edwards wrote: >> Dear R-help, >> >> I have a very large ascii data file, of which I only want to read in >> selected lines (e.g. on fourth of the lines); determining which lines >> depends on the lines content. So far, I have found two approaches for doing >> this in R; 1) Read the file line by line using a repeat-loop and save the >> result in a temporary file or a variable, and 2) Read the entire file and >> filter/reshape it using *apply methods. To my understanding, the use of >> repeat{}-loops are quite slow in R, and reading an entire file to discard 3 >> quarters of the data is a bit of an overkill. Not to mention loading an >> 650MB text file into memory. >> >> What I am looking for is a function, that works like the first approach, but >> avoiding do- or repeat-loops, so I imagine it is implemented in a >> lower-level language, to be more efficient. Naturally, when calling the >> function, one would provide a function that determines if/how the line >> should be appended to a variable. Alternatively, an object working as an >> generator (in Python terms), could be used with the normal *apply >> functions. I imagine this working differently from e.g. >> sapply(readLines("myfile.txt"), FUN=selector), in that "readLines" would be >> executed first, loading the entire file into memory and supplying it to >> sapply, whereas the generator-object only reads a line when sapply requests >> the next element. >> >> Are there options for this kind of operation? >> >> Kind regards, >> >> Stefan McKinnon Høj-Edwards Dept. of Genetics and Biotechnology >> PhD student Faculty of Agricultural Sciences >> stefan.hoj-edwa...@agrsci.dk Aarhus University >> Tel.: +45 8999 1291 Blichers Allé 20, Postboks 50 >> Web: www.iysik.com DK-8830 Tjele >> Tel.: +45 8999 1900 >> Web: www.agrsci.au.dk >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Jim Holtman Data Munger Guru What is the problem that you are trying to solve? ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.