Re: [R] What is the best package for large data cleaning (not statistical analysis)?

Sean Zhang Sun, 15 Mar 2009 05:15:51 -0700

Dear Jim:

Thanks for your reply.
Looks to me, you were using batching.
I used batching to digest large data in Matlab before.
Still wonder the answers to the two specifics questions without resorting to
batching.


Thanks.

-Sean




On Sat, Mar 14, 2009 at 10:13 PM, jim holtman <jholt...@gmail.com> wrote:

> Exactly what type of cleaning do you want to do on them?  Can you read
> in the data a block at a time (e.g., 1M records), clean them up and
> then write them back out?  You would have the choice of putting them
> back as a text file or possibly storing them using 'filehash'.  I have
> used that technique to segment a year's worth of data that was
> probably 3GB of text into monthly objects that were about 70MB
> dataframes that I stored using filehash.  These I then read back in to
> do processing where I could summarize by month.  So it all depends on
> what you want to do.
>
> You could read in the chunks, clean them and then reshape them into
> dataframes that you could process later.  You will still probably have
> the problem that all the data still won't fit in memory.  Now one
> thing I did was that since the dataframes were stored as binary
> objects in filehash, it was pretty fast to retrieve them, pick out the
> data I needed from each month and create a subset of just the data I
> needed that would now fit in memory.
>
> So it all depends ...........
>
> On Sat, Mar 14, 2009 at 8:46 PM, Sean Zhang <seane...@gmail.com> wrote:
> > Dear R helpers:
> >
> > I am a newbie to R and have a question related to cleaning large data
> frames
> > in R.
> >
> > So far, I have been using SAS for data cleaning because my data sets are
> > relatively large (handling multiple files, each could be as large as 5-10
> > G).
> > I am not a fan of SAS at all and am eager to move data cleaning tasks
> into R
> > completely.
> >
> > Seems to me, there are 3 options. Using SQL, ff or filehash. I do not
> want
> > to learn sql. so my question is more related to ff and filehash.
> >
> > In specifics,
> >
> > (1) for merging two large data frames,  which one is better, ff vs.
> > filehash?
> > (2) for reshaping a large data frame (say from long to wide or the
> opposite)
> > which one is better, ff vs. filehash?
> >
> > If you can provide examples, that will be even better.
> >
> > Many thanks in advance.
> >
> > -Sean
> >
> >        [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] What is the best package for large data cleaning (not statistical analysis)?

Reply via email to