Re: [R] Merging big data sets

William Dunlap Mon, 09 Sep 2013 10:13:18 -0700

Your sample code does not run and you didn't show the format of
your inputs.  If your 'ttoevP' is a table without duplicate Start/End
pairs that maps Start/End to OEV.T then using match and subscripting
can be quicker than merge.  E.g.,


   f0 <- function (dataP, ttoevP) 
   {
       encode <- function(df) paste(df$Start, df$End, sep = "\r")
       if (anyDuplicated(ttoevPEncoded <- encode(ttoevP))) {
           stop("duplicated Start/End pairs in ttoevP")
       }
       i <- match(encode(dataP), ttoevPEncoded)
       dataP$OEV.T <- ttoevP$OEV.T[i]
       dataP
   }

I made sample inputs with

makeData <- function (nrow, nTimes) 
{
    Start <- trunc(runif(nrow, 1, nTimes))
    End <- trunc(runif(nrow, Start, nTimes))
    dataP <- data.frame(Start = Start, End = End)
    ttoevP <- expand.grid(Start = seq_len(nTimes), End = seq_len(nTimes))
    ttoevP <- ttoevP[ttoevP$Start <= ttoevP$End, ]
    ttoevP$OEV.T <- paste(ttoevP$Start, ttoevP$End, sep = "-")
    list(dataP = dataP, ttoevP = ttoevP)
}

For nrows=4*10^4 and nTimes=10^3 f0 took 1.5 seconds and merge 10.5.
Aside from the order of the output, they produced the same output.

You can make your looping solution faster by removing repeated
operations from the loop, especially when those operations operate
on a data.frame (vectors operations are much faster, and no operation
is faster still).  E.g., replace
    for(i in seq_len(nrow(df))) {
         df$column[i] <- func(i)
    }
with
    column <- df$column
    for(i in seq_len(nrow(df))) {
        column[i] <- func(i)
    }
    df$column <- column

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
> Behalf
> Of Renger van Nieuwkoop
> Sent: Monday, September 09, 2013 5:45 AM
> To: r-help@r-project.org
> Subject: [R] Merging big data sets
> 
> Hi
> I have 6 rather big data sets (between 400000 and 800000 lines) on transport 
> data (times,
> distances and travelers between nodes). They all have a common index 
> (start-end nodes).
> I want to aggregate this data, but for that I have to merge them.
> I tried to use "merge" with the result that R (3.0.1) crashes (Windows 8 
> machine, 16 Gb
> Ram).
> Then I tried the join from the data.table package. Here I got the message 
> that 2^34 is too
> big (no idea why it is 2^34 as it is a left join).
> Then I decided to do a loop using the tables and assigning them, which takes 
> a very, very
> long time (still running at the moment).
> 
> Here is the code:
> for (i in 1:length(dataP$Start)){
>     c<-dataP$Start[i]
>     d<-dataP$End[i]
>     dataP[J(c,d)]$OEV.T<-ttoevP[J(c,d)]$OEV.T
> }
> 
> dataP has 800'000 lines and ttoevP has about 500'000 lines.
> 
> Any hints to speed up this process are welcome.
> 
> Renger
> _________________________________________
> Centre of Economic Research (CER-ETH)
> Zürichbergstrasse 18 (ZUE)
> CH - 8032 Zürich
> +41 44 632 02 63
> mailto: reng...@etzh.ch<mailto:reng...@etzh.ch>
> blog.modelworks.ch
> 
> 
>       [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Merging big data sets

Reply via email to