Your sample code does not run and you didn't show the format of your inputs. If your 'ttoevP' is a table without duplicate Start/End pairs that maps Start/End to OEV.T then using match and subscripting can be quicker than merge. E.g.,
f0 <- function (dataP, ttoevP) { encode <- function(df) paste(df$Start, df$End, sep = "\r") if (anyDuplicated(ttoevPEncoded <- encode(ttoevP))) { stop("duplicated Start/End pairs in ttoevP") } i <- match(encode(dataP), ttoevPEncoded) dataP$OEV.T <- ttoevP$OEV.T[i] dataP } I made sample inputs with makeData <- function (nrow, nTimes) { Start <- trunc(runif(nrow, 1, nTimes)) End <- trunc(runif(nrow, Start, nTimes)) dataP <- data.frame(Start = Start, End = End) ttoevP <- expand.grid(Start = seq_len(nTimes), End = seq_len(nTimes)) ttoevP <- ttoevP[ttoevP$Start <= ttoevP$End, ] ttoevP$OEV.T <- paste(ttoevP$Start, ttoevP$End, sep = "-") list(dataP = dataP, ttoevP = ttoevP) } For nrows=4*10^4 and nTimes=10^3 f0 took 1.5 seconds and merge 10.5. Aside from the order of the output, they produced the same output. You can make your looping solution faster by removing repeated operations from the loop, especially when those operations operate on a data.frame (vectors operations are much faster, and no operation is faster still). E.g., replace for(i in seq_len(nrow(df))) { df$column[i] <- func(i) } with column <- df$column for(i in seq_len(nrow(df))) { column[i] <- func(i) } df$column <- column Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > Behalf > Of Renger van Nieuwkoop > Sent: Monday, September 09, 2013 5:45 AM > To: r-help@r-project.org > Subject: [R] Merging big data sets > > Hi > I have 6 rather big data sets (between 400000 and 800000 lines) on transport > data (times, > distances and travelers between nodes). They all have a common index > (start-end nodes). > I want to aggregate this data, but for that I have to merge them. > I tried to use "merge" with the result that R (3.0.1) crashes (Windows 8 > machine, 16 Gb > Ram). > Then I tried the join from the data.table package. Here I got the message > that 2^34 is too > big (no idea why it is 2^34 as it is a left join). > Then I decided to do a loop using the tables and assigning them, which takes > a very, very > long time (still running at the moment). > > Here is the code: > for (i in 1:length(dataP$Start)){ > c<-dataP$Start[i] > d<-dataP$End[i] > dataP[J(c,d)]$OEV.T<-ttoevP[J(c,d)]$OEV.T > } > > dataP has 800'000 lines and ttoevP has about 500'000 lines. > > Any hints to speed up this process are welcome. > > Renger > _________________________________________ > Centre of Economic Research (CER-ETH) > Zürichbergstrasse 18 (ZUE) > CH - 8032 Zürich > +41 44 632 02 63 > mailto: reng...@etzh.ch<mailto:reng...@etzh.ch> > blog.modelworks.ch > > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.