Please see the following code for the runtime comparison between split() and mysplit.data.frame() (they do the same thing semantically). mysplit.data.frame() is a fix of split() in term of performance. Could somebody include this fix (with possible checking for corner cases) in future version of R and let me know the inclusion of the fix?
m=300000 n=6 k=30000 set.seed(0) x=replicate(n,rnorm(m)) f=sample(1:k, size=m, replace=T) mysplit.data.frame<-function(x,f) { print('processing data.frame') v=lapply( 1:dim(x)[[2]] , function(i) { split(x[,i],f) } ) w=lapply( seq(along=v[[1]]) , function(i) { result=do.call( cbind , lapply(v, function(vj) { vj[[i]] } ) ) colnames(result)=colnames(x) return(result) } ) names(w)=names(v[[1]]) return(w) } system.time(split(as.data.frame(x),f)) system.time(mysplit.data.frame(as.data.frame(x),f)) ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel