Re: [R] Performance tuning tips when working with wide datasets

Prof Brian Ripley Wed, 24 Nov 2010 04:46:10 -0800

On Wed, 24 Nov 2010, Richard Vlasimsky wrote:

Does anyone have any performance tuning tips when working withdatasets that are extremely wide (e.g. 20,000 columns)?


Don't use data frames.

In particular, I am trying to perform a merge like below:

merged_data <- merge(data1, data2, by.x="date",by.y="date",all=TRUE,sort=TRUE);
This statement takes about 8 hours to execute on a pretty fastmachine. The dataset data1 contains daily data going back to 1950(20,000 rows) and has 25 columns. The dataset data2 contains annualdata (only 60 observations), however there are lots of columns(20,000 of them).
I have to do a lot of these kinds of merges so need to figure out away to speed it up.
I have tried a number of different things to speed things up to noavail. I've noticed that rbinds execute much faster using matricesthan dataframes. However the performance improvement when usingmatrices (vs. data frames) on merges were negligible (8 hours downto 7). I tried casting my merge field (date) into various differentdata types (character, factor, date). This didn't seem to have anyeffect. I tried the hash package, however, merge couldn't coerce theclass into a data.frame. I've tried various ways to parellelizecomputation in the past, and found that to be problematic for avariety of reasons (runaway forked processes, doesn't run in a GUIenvironment, doesn't run on Macs, etc.).
I'm starting to run out of ideas, anyone? Merging a 60 row datasetshouldn't take that long.


Correct, but that is not what you said you are doing.

Why do you need to do the merge at all? Sounds like you simply needto create a suitable index into the rows of the 20,000 col data frame,and use that appropriately. E.g merge data1 with a data2 containingthe year and a column index=1:60.


date <- seq(from=as.Date("1950-01-01"), to=Sys.Date(), by=1)
year <- 1900+as.POSIXlt(date)$year
data1 <- data.frame(date, year, matrix(rnorm(22243*25),, 25))
fdata2 <- data.frame(year=1950:2010, index=1:61)

system.time(m <- merge(data1, fdata2, by='year', all=TRUE, sort=TRUE))

   user  system elapsed
  0.226   0.011   0.237

and use m$index to index the real data2.

We don't have an example, but at the least you are ending up with adata frame with 20,000 rows and ca 20,000 cols. That's not a smallobject.

Thanks,
Richard
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Brian D. Ripley,                  rip...@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Performance tuning tips when working with wide datasets

Reply via email to