On Wed, 24 Nov 2010, Richard Vlasimsky wrote:
Does anyone have any performance tuning tips when working with
datasets that are extremely wide (e.g. 20,000 columns)?
Don't use data frames.
In particular, I am trying to perform a merge like below:
merged_data <- merge(data1, data2, by.x="date",by.y="date",all=TRUE,sort=TRUE);
This statement takes about 8 hours to execute on a pretty fast
machine. The dataset data1 contains daily data going back to 1950
(20,000 rows) and has 25 columns. The dataset data2 contains annual
data (only 60 observations), however there are lots of columns
(20,000 of them).
I have to do a lot of these kinds of merges so need to figure out a
way to speed it up.
I have tried a number of different things to speed things up to no
avail. I've noticed that rbinds execute much faster using matrices
than dataframes. However the performance improvement when using
matrices (vs. data frames) on merges were negligible (8 hours down
to 7). I tried casting my merge field (date) into various different
data types (character, factor, date). This didn't seem to have any
effect. I tried the hash package, however, merge couldn't coerce the
class into a data.frame. I've tried various ways to parellelize
computation in the past, and found that to be problematic for a
variety of reasons (runaway forked processes, doesn't run in a GUI
environment, doesn't run on Macs, etc.).
I'm starting to run out of ideas, anyone? Merging a 60 row dataset
shouldn't take that long.
Correct, but that is not what you said you are doing.
Why do you need to do the merge at all? Sounds like you simply need
to create a suitable index into the rows of the 20,000 col data frame,
and use that appropriately. E.g merge data1 with a data2 containing
the year and a column index=1:60.
date <- seq(from=as.Date("1950-01-01"), to=Sys.Date(), by=1)
year <- 1900+as.POSIXlt(date)$year
data1 <- data.frame(date, year, matrix(rnorm(22243*25),, 25))
fdata2 <- data.frame(year=1950:2010, index=1:61)
system.time(m <- merge(data1, fdata2, by='year', all=TRUE, sort=TRUE))
user system elapsed
0.226 0.011 0.237
and use m$index to index the real data2.
We don't have an example, but at the least you are ending up with a
data frame with 20,000 rows and ca 20,000 cols. That's not a small
object.
Thanks,
Richard
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Brian D. Ripley, rip...@stats.ox.ac.uk
Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
University of Oxford, Tel: +44 1865 272861 (self)
1 South Parks Road, +44 1865 272866 (PA)
Oxford OX1 3TG, UK Fax: +44 1865 272595
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.