Re: [R] Large Dataset

Thomas Lumley Wed, 07 Jan 2009 02:02:43 -0800


There are several approaches to analyzing data sets much larger than memory, 
and the best approach does depend on the problem. It's certainly possible to 
process gigabytes of data on a 32-bit R system - the examples I've worked on 
are whole-genome association studies with 10^5-10^6 variables and 10^3-10^4 
observations. Other people have worked with much larger data sets.


Some approaches are:

- incremental reading using a connection to the file, reading a few thousand 
lines at a time.  The statistics Edwin wants can all be computed in a single 
pass through the data. This is what the biglm package does for linear models.

- storing the data in a relational database and then either
   *) using SQL commands (the mean, min, max are all built in to SQL) to do
       most of the work and just reading results (or interim results) into R
   *) reading appropriate chunks of the data into R and doing the computations
       there

- storing the data in netCDF or HDF5 formats and loading chunks into R.  These 
are less flexible than relational databases but more efficient for certain 
sorts of subsets.

- memory-mapping the data file (the ff package does this) to read sections of 
data. I haven't tried this, so I'm not sort where its advantages and 
disadvantages are.


The bigmemory package does not address quite the same problem.  It deals with 
objects that fit in memory, but are large enough that copying them is a bad 
idea, and it also deals with sharing between processors.

      -thomas





On Tue, 6 Jan 2009, Simon Pickett wrote:

Hi,

I am not very knowledgeable about this kind of stuff but my guess is that ifyou have a fairly slow computer and massive data sets there isnt alot you cando except get a better computer, buy more RAM or use something like SASinstead?


Hopefully someone else will chip in Edwin, best of luck.

Simon.


----- Original Message ----- From: "Edwin Sendjaja" <edw...@web.de>
To: "Simon Pickett" <simon.pick...@bto.org>
Cc: <r-help@r-project.org>
Sent: Tuesday, January 06, 2009 2:53 PM
Subject: Re: [R] Large Dataset

Hi Simon,

My RAM is only 3.2 GB (actually it should be 4 GB, but my Motherboard doesnt
support it.

R use almost of all my RAM and half of my swap. I think memory.limit willnot

solve my problem.  It seems that I need  RAM.

Unfortunately, I can't buy more RAM.

Why R is slow reading big data set?


Edwin

Only a couple of weeks ago I had to deal with this.

adjust the memory limit as follows, although you might not want 4000, that
is quite high....

memory.limit(size = 4000)

Simon.

----- Original Message -----
From: "Edwin Sendjaja" <edw...@web.de>
To: "Simon Pickett" <simon.pick...@bto.org>
Cc: <r-help@r-project.org>
Sent: Tuesday, January 06, 2009 12:24 PM
Subject: Re: [R] Large Dataset

> Hi Simon,
>
> Thank for your reply.
> I have read ?Memory but I dont understand how to use. I am not sure if
> that
> can solve my problem. Can you tell me more detail?
>
> Thanks,
>
> Edwin
>
>> type
>>
>> ?memory
>>
>> into R and that will explain what to do...
>>
>> S
>> ----- Original Message -----
>> From: "Edwin Sendjaja" <edw...@web.de>
>> To: <r-help@r-project.org>
>> Sent: Tuesday, January 06, 2009 11:41 AM
>> Subject: [R] Large Dataset
>>
>> > Hi alI,
>> >

>> > I have a 3.1 GB Dataset ( with 11 coloumns and lots data in int >>> and

>> > string).
>> > If I use read.table; it takes very long. It seems that my RAM is not
>> > big
>> > enough (overload) I have 3.2 RAM and  7GB SWAP, 64 Bit Ubuntu.
>> >
>> > Is there a best sultion to read a large data R? I have seen, that
>> > people
>> > suggest to use bigmemory package, ff. But it seems very complicated.
>> > I dont
>> > know how to start with that packages.
>> >

>> > i have tried to use bigmemory. But I got some kind of errors. Then>> > I

>> > gave up.
>> >
>> >
>> > can someone give me an simple example how ot use ff or bigmemory?or
>> > maybe
>> > re
>> > better sollution?
>> >
>> >
>> >
>> > Thank you in advance,
>> >
>> >
>> > Edwin
>> >
>> > ______________________________________________
>> > R-help@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Thomas Lumley                   Assoc. Professor, Biostatistics
tlum...@u.washington.edu        University of Washington, Seattle

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Large Dataset

Reply via email to