Re: [R] performance of do.call("rbind")

Jeff Newmiller Mon, 27 Jun 2016 20:36:08 -0700

Sarah, you make it sound as though everyone should be using matrices, eventhough they have distinct disadvantages for many types of analysis.

You are right that rbind on data frames is slow, but dplyr::bind_rowshandles data frames almost as fast as your rbind-ing matrices solution.

And if you apply knowledge of your data frames and don't do the errorchecking that bind_rows does, you can beat both of them without convertingto matrices, as the "tm.dfcolcat" solution below illustrates. (Not foreveryday use, but if you have a big job and the data are clean this maymake a difference.)

Data frames, handled properly, are only slightly slower than matrices formost purposes. I have seen numerical solutions of partial differentialequations run lightning fast using pre-allocated data frames and vectorcalculations, so even traditional "matrix" calculation domains don't haveuse matrices to be competitive.


######################
testsize <- 5000
N <- 20

set.seed(1234)
testdf.list <- lapply( seq_len( testsize )
                     , function( x ) {
                        data.frame( matrix( runif( 300 ), nrow=100 ) )
                       }
                     )

tm.rbind <- function( x = 0 ) {
  system.time( r.df <- do.call( "rbind", testdf.list ) )
}
#toss the first one
tm.rbind()
tms.rbind <- data.frame( do.call( rbind
                                , lapply( 1:N
                                        , tm.rbind
                                        )
                                )
                       , which = "rbind"
                       )

tm.rbindm <- function( x = 0 ) {
  system.time({
    testm.list <- lapply( testdf.list, as.matrix )
    r.m <- do.call( rbind, testm.list )
  })
}
#toss the first one
tm.rbindm()
tms.rbindm <- data.frame( do.call( rbind
                                 , lapply( 1:N
                                         , tm.rbindm
                                         )
                                 )
                        , which = "rbindm"
                        )

tm.dfcopy <- function(x=0) {
  system.time({
    l.df <- data.frame( matrix( NA
                              , nrow=100 * testsize
                              , ncol=3
                              )
                      )
    for ( i in seq_len( testsize ) ) {
      start <- ( i - 1 ) * 100 + 1
      end <- i * 100
      l.df[ start:end, ] <- testdf.list[[ i ]]
    }
  })
}
#toss the first one
tm.dfcopy()
tms.dfcopy <- data.frame( do.call( rbind
                                 , lapply( 1:N
                                         , tm.dfcopy
                                         )
                                 )
                        , which = "dfcopy"
                        )

tm.dfmatcopy <- function(x=0) {
  system.time({
    l.m <- data.frame( matrix( NA
                             , nrow=100 * testsize
                             , ncol = 3
                             )
                     )
    testm.list <- lapply( testdf.list, as.matrix )
    for ( i in seq_len( testsize ) ) {
      start <- ( i - 1 ) * 100 + 1
      end <- i * 100
      l.m[ start:end, ] <- testm.list[[ i ]]
    }
  })
}
#toss the first one
tm.dfmatcopy()
tms.dfmatcopy <- data.frame( do.call( rbind
                                    , lapply( 1:N
                                            , tm.dfmatcopy
                                            )
                                    )
                           , which = "dfmatcopy"
                           )

tm.bind_rows <- function(x=0) {
  system.time({
    dplyr::bind_rows( testdf.list )
  })
}
#toss the first one
tm.bind_rows()
tms.bind_rows <- data.frame( do.call( rbind
                                    , lapply( 1:N
                                            , tm.bind_rows
                                            )
                                    )
                           , which = "bind_rows"
                           )

tm.dfcolcat <- function(x=0) {
  system.time({
    mycolnames <- names( testdf.list[[ 1 ]] )
    result <-
      setNames( data.frame( lapply( mycolnames
                                  , function( colidx ) {
                                      do.call( c
                                             , lapply( testdf.list
                                                     , function( v ) {
                                                         v[[ colidx ]]
                                                       }
                                                     )
                                             )
                                    }
                                  )
                          )
              , mycolnames
              )
      })
}
#toss the first one
tm.dfcolcat()
tms.dfcolcat <- data.frame( do.call( rbind, lapply( 1:N
                                                  , tm.dfcolcat
                                                  )
                                   )
                          , which = "dfcolcat"
                          )

tms.sarah <- read.table( text=
"   user  system elapsed  which
  34.280   0.009  34.317  tm.rbind
   0.310   0.000   0.311  tm.rbindm
  81.890   0.069  82.162  tm.dfcopy
  67.664   0.047  68.009  tm.dfmatcopy
", header = TRUE, as.is=TRUE )
mergetms <- rbind( tms.rbind
                 , tms.rbindm
                 , tms.dfcopy
                 , tms.dfmatcopy
                 , tms.bind_rows
                 , tms.dfcolcat
                 )
mergetms$which <- factor( mergetms$which
                        , levels = c( "rbind"
                                    , "rbindm"
                                    , "dfcopy"
                                    , "dfmatcopy"
                                    , "bind_rows"
                                    , "dfcolcat"
                                    )
                        )
plot( user.self ~ which, data=mergetms )
plot( user.self ~ which, data=mergetms, ylim=c(0,4) )

summary( tms.rbind )
#   user.self        sys.self         elapsed        user.child    sys.child
# Min.   :18.84   Min.   :0.0000   Min.   :18.92   Min.   : NA   Min.   : NA
# 1st Qu.:20.83   1st Qu.:0.0275   1st Qu.:20.96   1st Qu.: NA   1st Qu.: NA
# Median :22.91   Median :0.0400   Median :23.00   Median : NA   Median : NA
# Mean   :25.06   Mean   :0.0430   Mean   :25.21   Mean   :NaN   Mean   :NaN
# 3rd Qu.:24.29   3rd Qu.:0.0600   3rd Qu.:24.39   3rd Qu.: NA   3rd Qu.: NA
# Max.   :39.36   Max.   :0.1000   Max.   :39.94   Max.   : NA   Max.   : NA
#                                                  NA's   :20    NA's   :20

summary( tms.rbindm )
#   user.self         sys.self    elapsed         user.child    sys.child
# Min.   :0.2200   Min.   :0   Min.   :0.2200   Min.   : NA   Min.   : NA
# 1st Qu.:0.5600   1st Qu.:0   1st Qu.:0.5800   1st Qu.: NA   1st Qu.: NA
# Median :0.5850   Median :0   Median :0.5900   Median : NA   Median : NA
# Mean   :0.5465   Mean   :0   Mean   :0.5555   Mean   :NaN   Mean   :NaN
# 3rd Qu.:0.5900   3rd Qu.:0   3rd Qu.:0.5925   3rd Qu.: NA   3rd Qu.: NA
# Max.   :0.6100   Max.   :0   Max.   :0.6100   Max.   : NA   Max.   : NA
#                                               NA's   :20    NA's   :20

summary( tms.dfcopy )
#   user.self        sys.self         elapsed        user.child    sys.child
# Min.   :114.2   Min.   :0.0000   Min.   :114.3   Min.   : NA   Min.   : NA
# 1st Qu.:122.7   1st Qu.:0.0000   1st Qu.:123.0   1st Qu.: NA   1st Qu.: NA
# Median :128.3   Median :0.0050   Median :128.4   Median : NA   Median : NA
# Mean   :134.5   Mean   :0.0185   Mean   :134.8   Mean   :NaN   Mean   :NaN
# 3rd Qu.:134.7   3rd Qu.:0.0325   3rd Qu.:134.8   3rd Qu.: NA   3rd Qu.: NA
# Max.   :261.5   Max.   :0.0800   Max.   :263.4   Max.   : NA   Max.   : NA
#                                                  NA's   :20    NA's   :20

summary( tms.dfmatcopy )
#   user.self         sys.self         elapsed        user.child    sys.child
# Min.   : 98.15   Min.   : 0.050   Min.   :102.0   Min.   : NA   Min.   : NA
# 1st Qu.:136.47   1st Qu.: 3.495   1st Qu.:144.6   1st Qu.: NA   1st Qu.: NA
# Median :147.53   Median : 7.135   Median :158.3   Median : NA   Median : NA
# Mean   :177.10   Mean   : 7.030   Mean   :185.2   Mean   :NaN   Mean   :NaN
# 3rd Qu.:159.12   3rd Qu.:10.932   3rd Qu.:166.9   3rd Qu.: NA   3rd Qu.: NA
# Max.   :362.95   Max.   :16.100   Max.   :364.3   Max.   : NA   Max.   : NA
#                                                   NA's   :20    NA's

summary( tms.bind_rows )
#   user.self         sys.self    elapsed         user.child    sys.child
# Min.   :0.8200   Min.   :0   Min.   :0.8200   Min.   : NA   Min.   : NA
# 1st Qu.:0.8300   1st Qu.:0   1st Qu.:0.8375   1st Qu.: NA   1st Qu.: NA
# Median :0.8400   Median :0   Median :0.8400   Median : NA   Median : NA
# Mean   :0.8460   Mean   :0   Mean   :0.8480   Mean   :NaN   Mean   :NaN
# 3rd Qu.:0.8525   3rd Qu.:0   3rd Qu.:0.8525   3rd Qu.: NA   3rd Qu.: NA
# Max.   :0.9400   Max.   :0   Max.   :0.9900   Max.   : NA   Max.   : NA
#                                               NA's   :20    NA's   :20

summary( tms.dfcolcat )
# user.self        sys.self    elapsed        user.child    sys.child
# Min.   :0.340   Min.   :0   Min.   :0.340   Min.   : NA   Min.   : NA
# 1st Qu.:0.350   1st Qu.:0   1st Qu.:0.350   1st Qu.: NA   1st Qu.: NA
# Median :0.360   Median :0   Median :0.360   Median : NA   Median : NA
# Mean   :0.358   Mean   :0   Mean   :0.357   Mean   :NaN   Mean   :NaN
# 3rd Qu.:0.360   3rd Qu.:0   3rd Qu.:0.360   3rd Qu.: NA   3rd Qu.: NA
# Max.   :0.380   Max.   :0   Max.   :0.380   Max.   : NA   Max.   : NA
#                                             NA's   :20    NA's   :20


######################

On Mon, 27 Jun 2016, Sarah Goslee wrote:

That's not what I said, though, and it's not necessarily true. Growing
an object within a loop _is_ a slow process, but that's not the
problem here. The problem is using data frames instead of matrices.
The need to manage column classes is very costly. Converting to
matrices will almost always be enormously faster.

Here's an expansion of the previous example I posted, in four parts:
1. do.call with data frame - very slow - 34.317 s elapsed time for
2000 data frames
2. do.call with matrix - very fast - 0.311 s elapsed
3. pre-allocated loop with data frame - even slower (!) - 82.162 s
4. pre-allocated loop with matrix - very fast - 68.009 s

It matters whether the columns are converted to numeric or character,
and the time doesn't scale linearly with list length. For a particular
problem, the best solution may vary greatly (and I didn't even include
packages beyond the base functionality). In general, though, using
matrices is faster than using data frames, and using do.call is faster
than using a pre-allocated loop, which is much faster than growing an
object.

Sarah

testsize <- 5000

set.seed(1234)
testdf <- data.frame(matrix(runif(300), nrow=100, ncol=3))
testdf.list <- lapply(seq_len(testsize), function(x)testdf)

system.time(r.df <- do.call("rbind", testdf.list))

  user  system elapsed
34.280   0.009  34.317


system.time({

+ testm.list <- lapply(testdf.list, as.matrix)
+ r.m <- do.call("rbind", testm.list)
+ })
  user  system elapsed
 0.310   0.000   0.311


system.time({

+ l.df <- data.frame(matrix(NA, nrow=100 * testsize, ncol=3))
+ for(i in seq_len(testsize)) {
+ start <- (i-1)*100 + 1
+ end <- i*100
+ l.df[start:end, ] <- testdf.list[[i]]
+ }
+ })
  user  system elapsed
81.890   0.069  82.162


system.time({

+ l.m <- data.frame(matrix(NA, nrow=100 * testsize, ncol=3))
+ testm.list <- lapply(testdf.list, as.matrix)
+ for(i in seq_len(testsize)) {
+ start <- (i-1)*100 + 1
+ end <- i*100
+ l.m[start:end, ] <- testm.list[[i]]
+ }
+ })
  user  system elapsed
67.664   0.047  68.009




On Mon, Jun 27, 2016 at 1:05 PM, Marc Schwartz <marc_schwa...@me.com> wrote:

Hi,

Just to add my tuppence, which might not even be worth that these days...

I found the following blog post from 2013, which is likely dated to some 
extent, but provided some benchmarks for a few methods:

  
http://rcrastinate.blogspot.com/2013/05/the-rbinding-race-for-vs-docall-vs.html

There is also a comment with a reference there to using the data.table package, 
which I don't use, but may be something to evaluate.

As Bert and Sarah hinted at, there is overhead in taking the repetitive 
piecemeal approach.

If all of your data frames are of the exact same column structure (column order, column types), it 
may be prudent to do your own pre-allocation of a data frame that is the target row total size and 
then "insert" each "sub" data frame by using row indexing into the target 
structure.

Regards,

Marc Schwartz

On Jun 27, 2016, at 11:54 AM, Witold E Wolski <wewol...@gmail.com> wrote:

Hi Bert,

You are most likely right. I just thought that do.call("rbind", is
somehow more clever and allocates the memory up front. My error. After
more searching I did find rbind.fill from plyr which seems to do the
job (it computes the size of the result data.frame and allocates it
first).

best

On 27 June 2016 at 18:49, Bert Gunter <bgunter.4...@gmail.com> wrote:

The following might be nonsense, as I have no understanding of R
internals; but ....

"Growing" structures in R by iteratively adding new pieces is often
warned to be inefficient when the number of iterations is large, and
your rbind() invocation might fall under this rubric. If so, you might
try  issuing the call say, 20 times, over 10k disjoint subsets of the
list, and then rbinding up the 20 large frames.

Again, caveat emptor.

Cheers,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewol...@gmail.com> wrote:

I have a list (variable name data.list) with approx 200k data.frames
with dim(data.frame) approx 100x3.

a call

data <-do.call("rbind", data.list)

does not complete - run time is prohibitive (I killed the rsession
after 5 minutes).

I would think that merging data.frame's is a common operation. Is
there a better function (more performant) that I could use?

Thank you.
Witold


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnew...@dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] performance of do.call("rbind")

Reply via email to