[R] Significant performance difference between split of a data.frame and split of vectors

Peng Yu Tue, 08 Dec 2009 20:28:59 -0800

I have the following code, which tests the split on a data.frame and
the split on each column (as vector) separately. The runtimes are of
10 time difference. When m and k increase, the difference become even
bigger.


I'm wondering why the performance on data.frame is so bad. Is it a bug
in R? Can it be improved?

> system.time(split(as.data.frame(x),f))
   user  system elapsed
  1.700   0.010   1.786
>
> system.time(lapply(
+         1:dim(x)[[2]]
+         , function(i) {
+           split(x[,i],f)
+         }
+         )
+     )
   user  system elapsed
  0.170   0.000   0.167

###########
m=30000
n=6
k=3000

set.seed(0)
x=replicate(n,rnorm(m))
f=sample(1:k, size=m, replace=T)

system.time(split(as.data.frame(x),f))

system.time(lapply(
        1:dim(x)[[2]]
        , function(i) {
          split(x[,i],f)
        }
        )
    )

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Significant performance difference between split of a data.frame and split of vectors

Reply via email to