Re: [R] How to separate huge dataset into chunks

Thomas Lumley Thu, 26 Mar 2009 00:37:53 -0700

On Wed, 25 Mar 2009, Guillaume Filteau wrote:

Hello Thomas,
Thanks for your help!
Sadly your code does not work for the last chunk, because its length is shorterthan nrows.


You just need to move the test to the bottom of the loop

      repeat{
       chunk<-read.table(conn, nrows=10000,col.names=nms)
         ## do something to the chunk
       if(length(chunk)<10000) break
      }

Quoting Thomas Lumley <tlum...@u.washington.edu>:
On Tue, 24 Mar 2009, Guillaume Filteau wrote:
Hello all,
Im trying to take a huge dataset (1.5 GB) and separate it into smallerchunks with R.
So far I had nothing but problems.
I cannot load the whole dataset in R due to memory problems. So, I insteadtry to load a few (100000) lines at a time (with read.table).
However, R kept crashing (with no error message) at about the 6800000line. This is extremely frustrating.
To try to fix this, I used connections with read.table. However, I now geta cryptic error telling me no lines available in input.
Is there any way to make this work?
There might be an error in line 42 of your script. Or somewhere else. Theerror message is cryptically saying that there were no lines of textavailable in the input connection, so presumably the connection wasn'tpointed at your file correctly.
It's hard to guess without seeing what you are doing, but
   conn <- file("mybigfile", open="r")
   chunk<- read.table(conn, header=TRUE, nrows=10000)
   nms <- names(chunk)
   while(length(chunk)==10000){
      chunk<-read.table(conn, nrows=10000,col.names=nms)
      ## do something to the chunk
   }
   close(conn)

should work. This sort of thing certainly does work routinely.
It's probably not worth reading 100,000 lines at a time unless your computerhas a lot of memory. Reducing the chunk size to 10,000 shouldn't introducemuch extra overhead and may well increase the speed by reducing memory use.
    -thomas

Thomas Lumley                   Assoc. Professor, Biostatistics
tlum...@u.washington.edu        University of Washington, Seattle
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley                   Assoc. Professor, Biostatistics
tlum...@u.washington.edu        University of Washington, Seattle

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] How to separate huge dataset into chunks

Reply via email to