Sarah, you make it sound as though everyone should be using matrices, even
though they have distinct disadvantages for many types of analysis.
You are right that rbind on data frames is slow, but dplyr::bind_rows
handles data frames almost as fast as your rbind-ing matrices solution.
And if you apply knowledge of your data frames and don't do the error
checking that bind_rows does, you can beat both of them without converting
to matrices, as the "tm.dfcolcat" solution below illustrates. (Not for
everyday use, but if you have a big job and the data are clean this may
make a difference.)
Data frames, handled properly, are only slightly slower than matrices for
most purposes. I have seen numerical solutions of partial differential
equations run lightning fast using pre-allocated data frames and vector
calculations, so even traditional "matrix" calculation domains don't have
use matrices to be competitive.
######################
testsize <- 5000
N <- 20
set.seed(1234)
testdf.list <- lapply( seq_len( testsize )
, function( x ) {
data.frame( matrix( runif( 300 ), nrow=100 ) )
}
)
tm.rbind <- function( x = 0 ) {
system.time( r.df <- do.call( "rbind", testdf.list ) )
}
#toss the first one
tm.rbind()
tms.rbind <- data.frame( do.call( rbind
, lapply( 1:N
, tm.rbind
)
)
, which = "rbind"
)
tm.rbindm <- function( x = 0 ) {
system.time({
testm.list <- lapply( testdf.list, as.matrix )
r.m <- do.call( rbind, testm.list )
})
}
#toss the first one
tm.rbindm()
tms.rbindm <- data.frame( do.call( rbind
, lapply( 1:N
, tm.rbindm
)
)
, which = "rbindm"
)
tm.dfcopy <- function(x=0) {
system.time({
l.df <- data.frame( matrix( NA
, nrow=100 * testsize
, ncol=3
)
)
for ( i in seq_len( testsize ) ) {
start <- ( i - 1 ) * 100 + 1
end <- i * 100
l.df[ start:end, ] <- testdf.list[[ i ]]
}
})
}
#toss the first one
tm.dfcopy()
tms.dfcopy <- data.frame( do.call( rbind
, lapply( 1:N
, tm.dfcopy
)
)
, which = "dfcopy"
)
tm.dfmatcopy <- function(x=0) {
system.time({
l.m <- data.frame( matrix( NA
, nrow=100 * testsize
, ncol = 3
)
)
testm.list <- lapply( testdf.list, as.matrix )
for ( i in seq_len( testsize ) ) {
start <- ( i - 1 ) * 100 + 1
end <- i * 100
l.m[ start:end, ] <- testm.list[[ i ]]
}
})
}
#toss the first one
tm.dfmatcopy()
tms.dfmatcopy <- data.frame( do.call( rbind
, lapply( 1:N
, tm.dfmatcopy
)
)
, which = "dfmatcopy"
)
tm.bind_rows <- function(x=0) {
system.time({
dplyr::bind_rows( testdf.list )
})
}
#toss the first one
tm.bind_rows()
tms.bind_rows <- data.frame( do.call( rbind
, lapply( 1:N
, tm.bind_rows
)
)
, which = "bind_rows"
)
tm.dfcolcat <- function(x=0) {
system.time({
mycolnames <- names( testdf.list[[ 1 ]] )
result <-
setNames( data.frame( lapply( mycolnames
, function( colidx ) {
do.call( c
, lapply( testdf.list
, function( v ) {
v[[ colidx ]]
}
)
)
}
)
)
, mycolnames
)
})
}
#toss the first one
tm.dfcolcat()
tms.dfcolcat <- data.frame( do.call( rbind, lapply( 1:N
, tm.dfcolcat
)
)
, which = "dfcolcat"
)
tms.sarah <- read.table( text=
" user system elapsed which
34.280 0.009 34.317 tm.rbind
0.310 0.000 0.311 tm.rbindm
81.890 0.069 82.162 tm.dfcopy
67.664 0.047 68.009 tm.dfmatcopy
", header = TRUE, as.is=TRUE )
mergetms <- rbind( tms.rbind
, tms.rbindm
, tms.dfcopy
, tms.dfmatcopy
, tms.bind_rows
, tms.dfcolcat
)
mergetms$which <- factor( mergetms$which
, levels = c( "rbind"
, "rbindm"
, "dfcopy"
, "dfmatcopy"
, "bind_rows"
, "dfcolcat"
)
)
plot( user.self ~ which, data=mergetms )
plot( user.self ~ which, data=mergetms, ylim=c(0,4) )
summary( tms.rbind )
# user.self sys.self elapsed user.child sys.child
# Min. :18.84 Min. :0.0000 Min. :18.92 Min. : NA Min. : NA
# 1st Qu.:20.83 1st Qu.:0.0275 1st Qu.:20.96 1st Qu.: NA 1st Qu.: NA
# Median :22.91 Median :0.0400 Median :23.00 Median : NA Median : NA
# Mean :25.06 Mean :0.0430 Mean :25.21 Mean :NaN Mean :NaN
# 3rd Qu.:24.29 3rd Qu.:0.0600 3rd Qu.:24.39 3rd Qu.: NA 3rd Qu.: NA
# Max. :39.36 Max. :0.1000 Max. :39.94 Max. : NA Max. : NA
# NA's :20 NA's :20
summary( tms.rbindm )
# user.self sys.self elapsed user.child sys.child
# Min. :0.2200 Min. :0 Min. :0.2200 Min. : NA Min. : NA
# 1st Qu.:0.5600 1st Qu.:0 1st Qu.:0.5800 1st Qu.: NA 1st Qu.: NA
# Median :0.5850 Median :0 Median :0.5900 Median : NA Median : NA
# Mean :0.5465 Mean :0 Mean :0.5555 Mean :NaN Mean :NaN
# 3rd Qu.:0.5900 3rd Qu.:0 3rd Qu.:0.5925 3rd Qu.: NA 3rd Qu.: NA
# Max. :0.6100 Max. :0 Max. :0.6100 Max. : NA Max. : NA
# NA's :20 NA's :20
summary( tms.dfcopy )
# user.self sys.self elapsed user.child sys.child
# Min. :114.2 Min. :0.0000 Min. :114.3 Min. : NA Min. : NA
# 1st Qu.:122.7 1st Qu.:0.0000 1st Qu.:123.0 1st Qu.: NA 1st Qu.: NA
# Median :128.3 Median :0.0050 Median :128.4 Median : NA Median : NA
# Mean :134.5 Mean :0.0185 Mean :134.8 Mean :NaN Mean :NaN
# 3rd Qu.:134.7 3rd Qu.:0.0325 3rd Qu.:134.8 3rd Qu.: NA 3rd Qu.: NA
# Max. :261.5 Max. :0.0800 Max. :263.4 Max. : NA Max. : NA
# NA's :20 NA's :20
summary( tms.dfmatcopy )
# user.self sys.self elapsed user.child sys.child
# Min. : 98.15 Min. : 0.050 Min. :102.0 Min. : NA Min. : NA
# 1st Qu.:136.47 1st Qu.: 3.495 1st Qu.:144.6 1st Qu.: NA 1st Qu.: NA
# Median :147.53 Median : 7.135 Median :158.3 Median : NA Median : NA
# Mean :177.10 Mean : 7.030 Mean :185.2 Mean :NaN Mean :NaN
# 3rd Qu.:159.12 3rd Qu.:10.932 3rd Qu.:166.9 3rd Qu.: NA 3rd Qu.: NA
# Max. :362.95 Max. :16.100 Max. :364.3 Max. : NA Max. : NA
# NA's :20 NA's
summary( tms.bind_rows )
# user.self sys.self elapsed user.child sys.child
# Min. :0.8200 Min. :0 Min. :0.8200 Min. : NA Min. : NA
# 1st Qu.:0.8300 1st Qu.:0 1st Qu.:0.8375 1st Qu.: NA 1st Qu.: NA
# Median :0.8400 Median :0 Median :0.8400 Median : NA Median : NA
# Mean :0.8460 Mean :0 Mean :0.8480 Mean :NaN Mean :NaN
# 3rd Qu.:0.8525 3rd Qu.:0 3rd Qu.:0.8525 3rd Qu.: NA 3rd Qu.: NA
# Max. :0.9400 Max. :0 Max. :0.9900 Max. : NA Max. : NA
# NA's :20 NA's :20
summary( tms.dfcolcat )
# user.self sys.self elapsed user.child sys.child
# Min. :0.340 Min. :0 Min. :0.340 Min. : NA Min. : NA
# 1st Qu.:0.350 1st Qu.:0 1st Qu.:0.350 1st Qu.: NA 1st Qu.: NA
# Median :0.360 Median :0 Median :0.360 Median : NA Median : NA
# Mean :0.358 Mean :0 Mean :0.357 Mean :NaN Mean :NaN
# 3rd Qu.:0.360 3rd Qu.:0 3rd Qu.:0.360 3rd Qu.: NA 3rd Qu.: NA
# Max. :0.380 Max. :0 Max. :0.380 Max. : NA Max. : NA
# NA's :20 NA's :20
######################
On Mon, 27 Jun 2016, Sarah Goslee wrote:
That's not what I said, though, and it's not necessarily true. Growing
an object within a loop _is_ a slow process, but that's not the
problem here. The problem is using data frames instead of matrices.
The need to manage column classes is very costly. Converting to
matrices will almost always be enormously faster.
Here's an expansion of the previous example I posted, in four parts:
1. do.call with data frame - very slow - 34.317 s elapsed time for
2000 data frames
2. do.call with matrix - very fast - 0.311 s elapsed
3. pre-allocated loop with data frame - even slower (!) - 82.162 s
4. pre-allocated loop with matrix - very fast - 68.009 s
It matters whether the columns are converted to numeric or character,
and the time doesn't scale linearly with list length. For a particular
problem, the best solution may vary greatly (and I didn't even include
packages beyond the base functionality). In general, though, using
matrices is faster than using data frames, and using do.call is faster
than using a pre-allocated loop, which is much faster than growing an
object.
Sarah
testsize <- 5000
set.seed(1234)
testdf <- data.frame(matrix(runif(300), nrow=100, ncol=3))
testdf.list <- lapply(seq_len(testsize), function(x)testdf)
system.time(r.df <- do.call("rbind", testdf.list))
user system elapsed
34.280 0.009 34.317
system.time({
+ testm.list <- lapply(testdf.list, as.matrix)
+ r.m <- do.call("rbind", testm.list)
+ })
user system elapsed
0.310 0.000 0.311
system.time({
+ l.df <- data.frame(matrix(NA, nrow=100 * testsize, ncol=3))
+ for(i in seq_len(testsize)) {
+ start <- (i-1)*100 + 1
+ end <- i*100
+ l.df[start:end, ] <- testdf.list[[i]]
+ }
+ })
user system elapsed
81.890 0.069 82.162
system.time({
+ l.m <- data.frame(matrix(NA, nrow=100 * testsize, ncol=3))
+ testm.list <- lapply(testdf.list, as.matrix)
+ for(i in seq_len(testsize)) {
+ start <- (i-1)*100 + 1
+ end <- i*100
+ l.m[start:end, ] <- testm.list[[i]]
+ }
+ })
user system elapsed
67.664 0.047 68.009
On Mon, Jun 27, 2016 at 1:05 PM, Marc Schwartz <marc_schwa...@me.com> wrote:
Hi,
Just to add my tuppence, which might not even be worth that these days...
I found the following blog post from 2013, which is likely dated to some
extent, but provided some benchmarks for a few methods:
http://rcrastinate.blogspot.com/2013/05/the-rbinding-race-for-vs-docall-vs.html
There is also a comment with a reference there to using the data.table package,
which I don't use, but may be something to evaluate.
As Bert and Sarah hinted at, there is overhead in taking the repetitive
piecemeal approach.
If all of your data frames are of the exact same column structure (column order, column types), it
may be prudent to do your own pre-allocation of a data frame that is the target row total size and
then "insert" each "sub" data frame by using row indexing into the target
structure.
Regards,
Marc Schwartz
On Jun 27, 2016, at 11:54 AM, Witold E Wolski <wewol...@gmail.com> wrote:
Hi Bert,
You are most likely right. I just thought that do.call("rbind", is
somehow more clever and allocates the memory up front. My error. After
more searching I did find rbind.fill from plyr which seems to do the
job (it computes the size of the result data.frame and allocates it
first).
best
On 27 June 2016 at 18:49, Bert Gunter <bgunter.4...@gmail.com> wrote:
The following might be nonsense, as I have no understanding of R
internals; but ....
"Growing" structures in R by iteratively adding new pieces is often
warned to be inefficient when the number of iterations is large, and
your rbind() invocation might fall under this rubric. If so, you might
try issuing the call say, 20 times, over 10k disjoint subsets of the
list, and then rbinding up the 20 large frames.
Again, caveat emptor.
Cheers,
Bert
Bert Gunter
"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
On Mon, Jun 27, 2016 at 8:51 AM, Witold E Wolski <wewol...@gmail.com> wrote:
I have a list (variable name data.list) with approx 200k data.frames
with dim(data.frame) approx 100x3.
a call
data <-do.call("rbind", data.list)
does not complete - run time is prohibitive (I killed the rsession
after 5 minutes).
I would think that merging data.frame's is a common operation. Is
there a better function (more performant) that I could use?
Thank you.
Witold
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.