Paul;
There is another group of functions that are similar to do.call in
their action of serial applications of a function to a list or vector.
They are somewhat more tolerant in that dyadic operators can be used
as the function argument, whereas do.call is really just expanding the
second argument The one that is _most_ similar is Reduce()
?Reduce
A somewhat smaller example than ours...
> df1<- data.frame(x=rnorm(5),y=rnorm(5))
> df2<- data.frame(x=rnorm(5),y=rnorm(5))
> df3<- data.frame(x=rnorm(5),y=rnorm(5))
> df4<- data.frame(x=rnorm(5),y=rnorm(5))
>
> mylist<- list(df1, df2, df3, df4)
> Reduce("rbind", mylist)
x y
1 -0.40175483 -0.96187409
2 0.76629538 0.92201312
3 2.44535842 0.90634825
4 0.57784258 -2.12756145
5 -1.62083235 -0.96310011
6 0.02625574 1.17684408
7 1.52412427 -0.26432372
<snipped remaining rows>
> do.call("+", list(1:3))
[1] 1 2 3
> do.call("+", list(a=1:3, b=3:5))
[1] 4 6 8
> do.call("+", list(a=1:3, b=3:5, cc=7:9))
Error in `+`(a = 1:3, b = 3:5, cc = 7:9) :
operator needs one or two arguments
> Reduce("+", list(a=1:3, b=3:5, cc=7:9))
[1] 11 14 17
Reduce has the capability of "accumulate"-ing its intermediate results:
> Reduce("+", 1:10)
[1] 55
> Reduce("+", 1:10, accumulate=TRUE)
[1] 1 3 6 10 15 21 28 36 45 55
On Sep 4, 2010, at 4:41 PM, Joshua Wiley wrote:
To echo what Erik said, the second argument of do.call(), arg, takes a
list of arguments that it passes to the specified function. Since
rbind() can bind any number of data frames, each dataframe in mylist
is rbind()ed at once.
These two calls should take about the same time (except for time
saved typing):
rbind(mylist[[1]], mylist[[2]], mylist[[3]], mylist[[4]]) # 1
do.call("rbind", mylist) # 2
On my system using:
set.seed(1)
dat <- rnorm(10^6)
df1 <- data.frame(x=dat, y=dat)
mylist <- list(df1, df1, df1, df1)
They do take about the same time (I started two instances of R and ran
both calls but swithed the order because R has a way of being faster
the second time you do the same thing).
[1] "Order: 1, 2"
user system elapsed
0.60 0.14 0.75
user system elapsed
0.41 0.14 0.54
[1] "Order: 2, 1"
user system elapsed
0.56 0.21 0.76
user system elapsed
0.41 0.14 0.55
Using the for loop is much slower in your later example because
rbind() is getting called over and over, plus you are incrementally
increasing the size of the object containing your results.
Often it happens that there is a list with lots of matrices or data
frames in it and we need to "stack those together"
For my own curiosity, are you reading in a bunch of separate data
files or are these the results of various operations that you
eventually want to combine?
Cheers,
Josh
On Sat, Sep 4, 2010 at 11:37 AM, Paul Johnson <pauljoh...@gmail.com>
wrote:
I've been doing some consulting with students who seem to come to R
from SAS. They are usually pre-occupied with do loops and it is
tough
to persuade them to trust R lists rather than keeping 100s of named
matrices floating around.
Often it happens that there is a list with lots of matrices or data
frames in it and we need to "stack those together". I thought it
would be a simple thing, but it turns out there are several ways to
get it done, and in this case, the most "elegant" way using do.call
is
not the fastest, but it does appear to be the least prone to
programmer error.
I have been staring at ?do.call for quite a while and I have to admit
that I just need some more explanations in order to interpret it. I
can't really get why this does work
do.call( "rbind", mylist)
but it does not work to do
sapply ( mylist, rbind).
Anyway, here's the self contained working example that compares the
speed of various approaches. If you send yet more ways to do this, I
will add them on and then post the result to my Working Example
collection.
## stackMerge.R
## Paul Johnson <pauljohn at ku.edu>
## 2010-09-02
## rbind is neat,but how to do it to a lot of
## data frames?
## Here is a test case
df1 <- data.frame(x=rnorm(100),y=rnorm(100))
df2 <- data.frame(x=rnorm(100),y=rnorm(100))
df3 <- data.frame(x=rnorm(100),y=rnorm(100))
df4 <- data.frame(x=rnorm(100),y=rnorm(100))
mylist <- list(df1, df2, df3, df4)
## Usually we have done a stupid
## loop to get this done
resultDF <- mylist[[1]]
for (i in 2:4) resultDF <- rbind(resultDF, mylist[[i]])
## My intuition was that this should work:
## lapply( mylist, rbind )
## but no! It just makes a new list
## This obliterates the columns
## unlist( mylist )
## I got this idea from code in the
## "complete" function in the "mice" package
## It uses brute force to allocate a big matrix of 0's and
## then it places the individual data frames into that matrix.
m <- 4
nr <- nrow(df1)
nc <- ncol(df1)
dataComplete <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc))
for (j in 1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ] <-
mylist[[j]]
## I searched a long time for an answer that looked better.
## This website is helpful:
## http://stackoverflow.com/questions/tagged/r
## I started to type in the question and 3 plausible answers
## popped up before I could finish.
## The terse answer is:
shortAnswer <- do.call("rbind",mylist)
## That's the right answer, see:
shortAnswer == dataComplete
## But I don't understand why it works.
## More importantly, I don't know if it is fastest, or best.
## It is certainly less error prone than "dataComplete"
## First, make a bigger test case and use system.time to evaluate
phony <- function(i){
data.frame(w=rnorm(1000), x=rnorm(1000),y=rnorm(1000),z=rnorm(1000))
}
mylist <- lapply(1:1000, phony)
### First, try the terse way
system.time( shortAnswer <- do.call("rbind", mylist) )
### Second, try the complete way:
m <- 1000
nr <- nrow(df1)
nc <- ncol(df1)
system.time(
dataComplete <- as.data.frame(matrix(0, nrow = nr*m, ncol = nc))
)
system.time(
for (j in 1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ] <-
mylist[[j]]
)
## On my Thinkpad T62 dual core, the "shortAnswer" approach takes
about
## three times as long:
## > system.time( bestAnswer <- do.call("rbind",mylist) )
## user system elapsed
## 14.270 1.170 15.433
## > system.time(
## + dataComplete <- as.data.frame(matrix(0, nrow = nr*m, ncol =
nc))
## + )
## user system elapsed
## 0.000 0.000 0.006
## > system.time(
## + for (j in 1:m) dataComplete[(((j-1)*nr) + 1):(j*nr), ] <-
mylist[[j]]
## + )
## user system elapsed
## 4.940 0.050 4.989
## That makes the do.call way look slow, and I said "hey,
## our stupid for loop at the beginning may not be so bad.
## Wrong. It is a disaster. Check this out:
## > resultDF <- phony(1)
## > system.time(
## + for (i in 2:1000) resultDF <- rbind(resultDF, mylist[[i]])
## + )
## user system elapsed
## 159.740 4.150 163.996
--
Paul E. Johnson
Professor, Political Science
1541 Lilac Lane, Room 504
University of Kansas
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.