On Wed, 24 Nov 2010, Richard Vlasimsky wrote:

Does anyone have any performance tuning tips when working with datasets that are extremely wide (e.g. 20,000 columns)?

Don't use data frames.

In particular, I am trying to perform a merge like below:

merged_data <- merge(data1, data2, by.x="date",by.y="date",all=TRUE,sort=TRUE);

This statement takes about 8 hours to execute on a pretty fast machine. The dataset data1 contains daily data going back to 1950 (20,000 rows) and has 25 columns. The dataset data2 contains annual data (only 60 observations), however there are lots of columns (20,000 of them).

I have to do a lot of these kinds of merges so need to figure out a way to speed it up.

I have tried a number of different things to speed things up to no avail. I've noticed that rbinds execute much faster using matrices than dataframes. However the performance improvement when using matrices (vs. data frames) on merges were negligible (8 hours down to 7). I tried casting my merge field (date) into various different data types (character, factor, date). This didn't seem to have any effect. I tried the hash package, however, merge couldn't coerce the class into a data.frame. I've tried various ways to parellelize computation in the past, and found that to be problematic for a variety of reasons (runaway forked processes, doesn't run in a GUI environment, doesn't run on Macs, etc.).

I'm starting to run out of ideas, anyone? Merging a 60 row dataset shouldn't take that long.

Correct, but that is not what you said you are doing.

Why do you need to do the merge at all? Sounds like you simply need to create a suitable index into the rows of the 20,000 col data frame, and use that appropriately. E.g merge data1 with a data2 containing the year and a column index=1:60.

date <- seq(from=as.Date("1950-01-01"), to=Sys.Date(), by=1)
year <- 1900+as.POSIXlt(date)$year
data1 <- data.frame(date, year, matrix(rnorm(22243*25),, 25))
fdata2 <- data.frame(year=1950:2010, index=1:61)

system.time(m <- merge(data1, fdata2, by='year', all=TRUE, sort=TRUE))
   user  system elapsed
  0.226   0.011   0.237

and use m$index to index the real data2.

We don't have an example, but at the least you are ending up with a data frame with 20,000 rows and ca 20,000 cols. That's not a small object.

Thanks,
Richard
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to