If your Z in reality is not naturally numeric try representing it as a factor and using the numeric levels as your numbers and then put the level labels back on:
m <- n <- 5 DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = letters[1:25]) Zn <- as.numeric(DF$Z) system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide")) system.time({Zn <- as.numeric(DF$Z) w2 <- xtabs(Zn ~ Y + X, DF) w2[w2 > 0] <- levels(DF$Z)[w2] w2[w2 == 0] <- NA }) On 8/24/06, Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > If for each combination of X and Y there is at most one Z and Z is numeric > (it could be made so in your example) then you could use xtabs which is > faster: > > > m <- n <- 10 > > DF <- data.frame(X = gl(m*n, 1), Y = gl(m, n), Z = 10*(1:(n*m))) > > system.time(w1 <- reshape(DF, timevar = "X", idvar = "Y", dir = "wide")) > [1] 0.33 0.00 0.33 NA NA > > system.time(w2 <- xtabs(Z ~ Y + X, DF)) > [1] 0.01 0.00 0.01 NA NA > > On 8/24/06, Mitch Skinner <[EMAIL PROTECTED]> wrote: > > Hello, > > > > I'm mailing r-devel because I think the performance problem I'm having > > is best solved by re-writing reshape() in C. I've been reading the > > "writing R extensions" documentation and I have some questions about > > they best way to write the C bits, but first let me describe my problem: > > > > I'm trying to reshape a long data frame with ~70 subjects measured at > > ~4500 "times" (in my case, it's ~4500 genetic loci on a chromosome) into > > a wide data frame with one column per locus ("time"). On my data > > (~300,000 rows for chromosome 1) it takes about twenty minutes on a > > 3.4GHz P4. Here's an R session that demonstrates it (this is pretty > > close to how my data actually looks): > > > > > version > > _ > > platform i686-redhat-linux-gnu > > arch i686 > > os linux-gnu > > system i686, linux-gnu > > status > > major 2 > > minor 3.1 > > year 2006 > > month 06 > > day 01 > > svn rev 38247 > > language R > > version.string Version 2.3.1 (2006-06-01) > > > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) + > > 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))), > > genotype=I(as.character(as.integer(runif(3e5, 1, 100))))) > > > system.time(testWide <- reshape(test, v.names=c("genotype"), > > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) > > [1] 1107.506 120.755 1267.568 0.000 0.000 > > > > I believe that this can be done a lot faster, and I think the problem > > comes from this loop in reshape.R (in reshapeWide): > > > > for(i in seq(length = length(times))) { > > thistime <- data[data[,timevar] %in% times[i],] > > rval[,varying[,i]] <- > > thistime[match(rval[,idvar],thistime[,idvar]), > > v.names] > > } > > > > I don't really understand everything going on under the hood here, but I > > believe the complexity of this loop is something like > > O(length(times)*(nrow(data)+(nrow(rval)*length(varying))). The profile > > shows the lion's share (90%) of the time being spent in [.data.frame. > > > > What I'd like to do is write a C loop to go through the source (long) > > data frame once (rather than once per time), and put the values into the > > destination rows/columns using hash tables to look up the right > > row/column. > > > > Assuming the hashing is constant-time bounded, then the reshape becomes > > O(nrow(data)*length(varying)). > > > > I'd like to use the abitrary-R-vector hashing functionality from > > unique.c, but I'm not sure how to do it. I think that functionality is > > useful to other C code, but the functions that I'm interested in are not > > part of the R api (they're defined with "static"). Assuming copy/paste > > is out, I can see two options: 1. to remove "static" from the > > declarations of some of the functions, and add prototypes for those > > functions to a new src/library/stats/src/reshape.c file (or to one of > > the header files in src/include), or 2. to add C functions to do the > > reshaping to src/main/unique.c and call those from > > src/library/stats/R/reshape.R. > > > > This is all assuming that it makes sense to include this in mainline > > R--obviously I think it's worthwhile, and I'm surprised other people > > aren't complaining more. I would be happy to write/rewrite until y'all > > are happy with how it's done, of course. > > > > I've written a proof-of-concept by copying and pasting the hashing > > functions, which (on the above data frame) runs 20 times faster than the > > R version of reshape. It still needs some debugging, to put it mildly > > (doesn't work properly on reals), but the basic idea appears to work. > > > > The change I made to the reshape R function looks like this: > > ===================== > > for(i in seq(length = length(times))) { > > - thistime <- data[data[,timevar] %in% times[i],] > > - rval[,varying[,i]] <- > > thistime[match(rval[,idvar],thistime[,idvar]), > > - v.names] > > + for (j in seq(length(v.names))) { > > + rval[,varying[j,i]] <- > > rep(vector(mode=typeof(data[[v.names[j]]]), 0), > > + length.out=nrow(rval)) > > + } > > } > > > > + colMap <- match(varying, names(rval)) > > + .Call("do_reshape_wide", > > + data[[idvar]], data[[timevar]], data[v.names], > > + rval, colMap, > > + v.names, times, rval[[idvar]]) > > + > > if (!is.null(new.row.names)) > > row.names(rval) <- new.row.names > > ===================== > > > > This part: > > rep(vector(mode=typeof(data[[v.names[j]]]), 0), length.out=nrow(rval)) > > is to initialize the output with appropriately-typed vectors full of > > NAs; if there's a better/right way to do this please let me know. > > > > The do_reshape_wide C function looks like this: > > ===================== > > SEXP do_reshape_wide(SEXP longIds, SEXP longTimes, SEXP longData, > > SEXP wideFrame, SEXP colMap, > > SEXP vnames, SEXP times, SEXP wideIds) { > > HashData idHash, timeHash; > > int longRows, numVarying; > > int rowNum, varying; > > int wideRow, curTime; > > SEXP wideCol, longCol; > > > > HashTableSetup(wideIds, &idHash); > > PROTECT(idHash.HashTable); > > DoHashing(wideIds, &idHash); > > > > HashTableSetup(times, &timeHash); > > PROTECT(timeHash.HashTable); > > DoHashing(times, &timeHash); > > > > longRows = length(longIds); > > numVarying = length(vnames); > > > > for (rowNum = 0; rowNum < longRows; rowNum++) { > > /* Lookup returns 1-based answers */ > > wideRow = Lookup(wideIds, longIds, rowNum, &idHash) - 1; > > curTime = Lookup(times, longTimes, rowNum, &timeHash) - 1; > > > > for (varying = 0; varying < numVarying; varying++) { > > /* colMap is 1-based */ > > wideCol = VECTOR_ELT(wideFrame, > > INTEGER(colMap)[(numVarying * curTime) > > + varying] - 1); > > longCol = VECTOR_ELT(longData, varying); > > > > SET_VECTOR_ELT(wideCol, wideRow, VECTOR_ELT(longCol, > > rowNum)); > > } > > } > > UNPROTECT(2); > > } > > ===================== > > > > None examples I recall from "Writing R extensions" had void C function > > return types; is that a rule? > > > > I've put the code up here: > > http://arctur.us/r/creshape.tar.gz > > > > I've never tried to make a package before, but you should be able to > > just do a R CMD INSTALL on it, if you want to try it out. Here's an R > > session using that code: > > > require(creshape) > > Loading required package: creshape > > [1] TRUE > > > test=data.frame(subject=I(as.character(as.integer(runif(3e5, 1, 70) + > > 1000))), locus=I(as.character(as.integer(runif(3e5, 1, 4500) + 1e6))), > > genotype=I(as.character(as.integer(runif(3e5, 1, 100))))) > > > system.time(testWide <- creshape(test, v.names=c("genotype"), > > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) > > [1] 60.756 1.540 62.598 0.000 0.000 > > > system.time(testWide <- reshape(test, v.names=c("genotype"), > > timevar="locus", idvar="subject", direction="wide"), gcFirst=TRUE) > > [1] 1278.231 78.389 1387.739 0.000 0.000 > > > > Any comments are appreciated, > > Mitch > > > > ______________________________________________ > > R-devel@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-devel > > > ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel