Re: [R] Performance tuning tips when working with wide datasets

Claudia Beleites Wed, 24 Nov 2010 04:27:31 -0800

Dear Richard,

Does anyone have any performance tuning tips when working with datasets that
are extremely wide (e.g. 20,000 columns)?

The obvious one is: use matrices – and take care that they don't get converted
back to data.frames.

In particular, I am trying to perform a merge like below:

merged_data<- merge(data1, data2,
by.x="date",by.y="date",all=TRUE,sort=TRUE);

This statement takes about 8 hours to execute on a pretty fast machine.  The
dataset data1 contains daily data going back to 1950 (20,000 rows) and has 25
columns.  The dataset data2 contains annual data (only 60 observations),
however there are lots of columns (20,000 of them).

I have to do a lot of these kinds of merges so need to figure out a way to
speed it up.

I have tried  a number of different things to speed things up to no avail.
I've noticed that rbinds execute much faster using matrices than dataframes.
However the performance improvement when using matrices (vs. data frames) on
merges were negligible (8 hours down to 7).

which is astonishing, as merge (matrix) uses merge.default, which boils down to
merge(as.data.frame(x), as.data.frame(y), ...)

 I tried casting my merge field
(date) into various different data types (character, factor, date).  This
didn't seem to have any effect. I tried the hash package, however, merge
couldn't coerce the class into a data.frame.  I've tried various ways to
parellelize computation in the past, and found that to be problematic for a
variety of reasons (runaway forked processes, doesn't run in a GUI
environment, doesn't run on Macs, etc.).

I'm starting to run out of ideas, anyone?  Merging a 60 row dataset shouldn't
take that long.


Do I understand correctly that the result should be a 20000 x 20025 matrix,
where the additional 25 columns are from data2 and end up in the rows of e.g.
every 1st of January?

In that case, you may be much faster producing tmp <- matrix (NA, 20000, 20000),
fill the values of data2 into the correct rows, and then cbind data1 and tmp.
Make sure you have enough RAM available: tmp is about 1.5 GB. If you manage to
do this without swapping, it should be reasonably fast.

If you end up writing a proper merge function for matrics, please let me know:
I'd be interested in using it...

Claudia

Thanks, Richard ______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal,
self-contained, reproducible code.



--
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Università degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: [email protected]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Performance tuning tips when working with wide datasets

Reply via email to