I make a version for matrix. Because, it would be more efficient to split each column of a matrix than to convert a matrix to a data.frame then call split() on the data.frame. Note that the version for a matrix and a data.frame is slightly different. Would somebody add this in R as well?
split.matrix<-function(x,f) { #print('processing matrix') v=lapply( 1:dim(x)[[2]] , function(i) { base:::split.default(x[,i],f)#the difference is here } ) w=lapply( seq(along=v[[1]]) , function(i) { result=do.call( cbind , lapply(v, function(vj) { vj[[i]] } ) ) colnames(result)=colnames(x) return(result) } ) names(w)=names(v[[1]]) return(w) } On Wed, Dec 9, 2009 at 5:44 PM, Charles C. Berry <cbe...@tajo.ucsd.edu> wrote: > On Wed, 9 Dec 2009, William Dunlap wrote: > >> Here are some differences between the current and proposed >> split.data.frame. > > Adding 'drop=FALSE' fixes this case. See in line correction below. > > Chuck > >> >>> d<-data.frame(Matrix=I(matrix(1:10, ncol=2)), >> >> Named=c(one=1,two=2,three=3,four=4,five=5), >> row.names=as.character(1001:1005)) >>> >>> group<-c("A","B","A","A","B") >>> split.data.frame(d,group) >> >> $A >> Matrix.1 Matrix.2 Named >> 1001 1 6 1 >> 1003 3 8 3 >> 1004 4 9 4 >> >> $B >> Matrix.1 Matrix.2 Named >> 1002 2 7 2 >> 1005 5 10 5 >> >>> mysplit.data.frame(d,group) # lost row.names and 2nd column of Matrix >> >> [1] "processing data.frame" >> $A >> Matrix Named >> [1,] 1 1 >> [2,] 3 3 >> [3,] 4 4 >> >> $B >> Matrix Named >> [1,] 2 2 >> [2,] 5 5 >> >> >> Bill Dunlap >> Spotfire, TIBCO Software >> wdunlap tibco.com >> >>> -----Original Message----- >>> From: r-devel-boun...@r-project.org >>> [mailto:r-devel-boun...@r-project.org] On Behalf Of >>> pengyu...@gmail.com >>> Sent: Wednesday, December 09, 2009 2:10 PM >>> To: r-de...@stat.math.ethz.ch >>> Cc: r-b...@r-project.org >>> Subject: [Rd] split() is slow on data.frame (PR#14123) >>> >>> Please see the following code for the runtime comparison between >>> split() and mysplit.data.frame() (they do the same thing >>> semantically). mysplit.data.frame() is a fix of split() in term of >>> performance. Could somebody include this fix (with possible checking >>> for corner cases) in future version of R and let me know the inclusion >>> of the fix? >>> >>> m=300000 >>> n=6 >>> k=30000 >>> >>> set.seed(0) >>> x=replicate(n,rnorm(m)) >>> f=sample(1:k, size=m, replace=T) >>> >>> mysplit.data.frame<-function(x,f) { >>> print('processing data.frame') >>> v=lapply( >>> 1:dim(x)[[2]] >>> , function(i) { >>> split(x[,i],f) > > Change to: > > split(x[,i,drop=FALSE],f) > > >>> } >>> ) >>> >>> w=lapply( >>> seq(along=v[[1]]) >>> , function(i) { >>> result=do.call( >>> cbind >>> , lapply(v, >>> function(vj) { >>> vj[[i]] >>> } >>> ) >>> ) >>> colnames(result)=colnames(x) >>> return(result) >>> } >>> ) >>> names(w)=names(v[[1]]) >>> return(w) >>> } >>> >>> system.time(split(as.data.frame(x),f)) >>> system.time(mysplit.data.frame(as.data.frame(x),f)) >>> >>> ______________________________________________ >>> r-de...@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> >> ______________________________________________ >> r-de...@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-devel >> > > Charles C. Berry (858) 534-2098 > Dept of Family/Preventive > Medicine > E mailto:cbe...@tajo.ucsd.edu UC San Diego > http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.