Richard, Try data.table. See the introduction vignette and the presentations e.g. there is a slide showing a join to 183,000,000 observations of daily stock prices in 0.002 seconds.
data.table has fast rolling joins (i.e. fast last observation carried forward) too. I see you asked about that on this list on 8 Nov. Also see fast aggregations using 'by' on a key()-ed in-memory table. I wonder if your 20,000 columns are always populated for all rows. If not then consider collapsing to a 3 column table (row,col,data) and then joining to that. You may have that format in your original data source anyway, so you may be able to skip a step you may have implemented already which expands that format to wide. In other words, keeping it narrow may be an option (like how a sparse matrix is stored). Matthew http://datatable.r-forge.r-project.org/ "Richard Vlasimsky" <richard.vlasim...@imidex.com> wrote in message news:2e042129-4430-4c66-9308-a36b761eb...@imidex.com... > > Does anyone have any performance tuning tips when working with datasets > that are extremely wide (e.g. 20,000 columns)? > > In particular, I am trying to perform a merge like below: > > merged_data <- merge(data1, data2, > by.x="date",by.y="date",all=TRUE,sort=TRUE); > > This statement takes about 8 hours to execute on a pretty fast machine. > The dataset data1 contains daily data going back to 1950 (20,000 rows) and > has 25 columns. The dataset data2 contains annual data (only 60 > observations), however there are lots of columns (20,000 of them). > > I have to do a lot of these kinds of merges so need to figure out a way to > speed it up. > > I have tried a number of different things to speed things up to no avail. > I've noticed that rbinds execute much faster using matrices than > dataframes. However the performance improvement when using matrices (vs. > data frames) on merges were negligible (8 hours down to 7). I tried > casting my merge field (date) into various different data types > (character, factor, date). This didn't seem to have any effect. I tried > the hash package, however, merge couldn't coerce the class into a > data.frame. I've tried various ways to parellelize computation in the > past, and found that to be problematic for a variety of reasons (runaway > forked processes, doesn't run in a GUI environment, doesn't run on Macs, > etc.). > > I'm starting to run out of ideas, anyone? Merging a 60 row dataset > shouldn't take that long. > > Thanks, > Richard ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.