Sorry. I sent this to r-help by mistake. Could somebody help delete it from the archive?
On Wed, Dec 9, 2009 at 9:29 PM, Peng Yu <pengyu...@gmail.com> wrote: > I make a version for matrix. Because, it would be more efficient to > split each column of a matrix than to convert a matrix to a data.frame > then call split() on the data.frame. Note that the version for a > matrix and a data.frame is slightly different. Would somebody add this > in R as well? > > split.matrix<-function(x,f) { > #print('processing matrix') > v=lapply( > 1:dim(x)[[2]] > , function(i) { > base:::split.default(x[,i],f)#the difference is here > } > ) > > w=lapply( > seq(along=v[[1]]) > , function(i) { > result=do.call( > cbind > , lapply(v, > function(vj) { > vj[[i]] > } > ) > ) > colnames(result)=colnames(x) > return(result) > } > ) > names(w)=names(v[[1]]) > return(w) > } > > > On Wed, Dec 9, 2009 at 5:44 PM, Charles C. Berry <cbe...@tajo.ucsd.edu> wrote: >> On Wed, 9 Dec 2009, William Dunlap wrote: >> >>> Here are some differences between the current and proposed >>> split.data.frame. >> >> Adding 'drop=FALSE' fixes this case. See in line correction below. >> >> Chuck >> >>> >>>> d<-data.frame(Matrix=I(matrix(1:10, ncol=2)), >>> >>> Named=c(one=1,two=2,three=3,four=4,five=5), >>> row.names=as.character(1001:1005)) >>>> >>>> group<-c("A","B","A","A","B") >>>> split.data.frame(d,group) >>> >>> $A >>> Matrix.1 Matrix.2 Named >>> 1001 1 6 1 >>> 1003 3 8 3 >>> 1004 4 9 4 >>> >>> $B >>> Matrix.1 Matrix.2 Named >>> 1002 2 7 2 >>> 1005 5 10 5 >>> >>>> mysplit.data.frame(d,group) # lost row.names and 2nd column of Matrix >>> >>> [1] "processing data.frame" >>> $A >>> Matrix Named >>> [1,] 1 1 >>> [2,] 3 3 >>> [3,] 4 4 >>> >>> $B >>> Matrix Named >>> [1,] 2 2 >>> [2,] 5 5 >>> >>> >>> Bill Dunlap >>> Spotfire, TIBCO Software >>> wdunlap tibco.com >>> >>>> -----Original Message----- >>>> From: r-devel-boun...@r-project.org >>>> [mailto:r-devel-boun...@r-project.org] On Behalf Of >>>> pengyu...@gmail.com >>>> Sent: Wednesday, December 09, 2009 2:10 PM >>>> To: r-de...@stat.math.ethz.ch >>>> Cc: r-b...@r-project.org >>>> Subject: [Rd] split() is slow on data.frame (PR#14123) >>>> >>>> Please see the following code for the runtime comparison between >>>> split() and mysplit.data.frame() (they do the same thing >>>> semantically). mysplit.data.frame() is a fix of split() in term of >>>> performance. Could somebody include this fix (with possible checking >>>> for corner cases) in future version of R and let me know the inclusion >>>> of the fix? >>>> >>>> m=300000 >>>> n=6 >>>> k=30000 >>>> >>>> set.seed(0) >>>> x=replicate(n,rnorm(m)) >>>> f=sample(1:k, size=m, replace=T) >>>> >>>> mysplit.data.frame<-function(x,f) { >>>> print('processing data.frame') >>>> v=lapply( >>>> 1:dim(x)[[2]] >>>> , function(i) { >>>> split(x[,i],f) >> >> Change to: >> >> split(x[,i,drop=FALSE],f) >> >> >>>> } >>>> ) >>>> >>>> w=lapply( >>>> seq(along=v[[1]]) >>>> , function(i) { >>>> result=do.call( >>>> cbind >>>> , lapply(v, >>>> function(vj) { >>>> vj[[i]] >>>> } >>>> ) >>>> ) >>>> colnames(result)=colnames(x) >>>> return(result) >>>> } >>>> ) >>>> names(w)=names(v[[1]]) >>>> return(w) >>>> } >>>> >>>> system.time(split(as.data.frame(x),f)) >>>> system.time(mysplit.data.frame(as.data.frame(x),f)) >>>> >>>> ______________________________________________ >>>> r-de...@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-devel >>>> >>> >>> ______________________________________________ >>> r-de...@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-devel >>> >> >> Charles C. Berry (858) 534-2098 >> Dept of Family/Preventive >> Medicine >> E mailto:cbe...@tajo.ucsd.edu UC San Diego >> http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 >> >> >> > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.