> [R] Yet another set of codes to optimize > Daren Tan daren76 at hotmail.com > Fri Dec 5 03:41:23 CET 2008 > > I have problems converting my dataset from long to wide format. Previous > attempts using reshape package and aggregate function were unsuccessful > as they took too long. Apparently, my simplified solution also lasted > as long. > > My complete codes is given below. When sample.size = 10000, the > execution takes about 20 seconds. But sample.size = 100000 seems to take > eternity. My actual sample.size is 15000000 i.e. 15 million. > > sample.size <- 10000 > > m <- data.frame(Name=sample(1:100000, sample.size, T), Type=sample(1:1000, > sample.size, T), Predictor=sample(LETTERS[1:10], sample.size, T)) > > res <- function(m) { > m.12.unique <- unique(m[,1:2]) > m.12.unique <- m.12.unique[order(m.12.unique[,1], m.12.unique[,2]),] > v1 <- paste(m.12.unique[,1], m.12.unique[,2], sep=".") > v2 <- c(sort(unique(m[,3]))) > res <- matrix(0, nr=length(v1), nc=length(v2), dimnames=list(v1, v2)) > m.ids <- paste(m[,1], m[,2], sep=".") > for(i in 1:nrow(m)) { > x <- m.ids[i] > y <- m[i,3] > res[x, y] <- res[x, y] + 1 > } > res <- data.frame(m.12.unique[,1], m.12.unique[,2], res, row.names=NULL) > colnames(res) <- c("Name", "Type", v2) > return(res) > } > > res(m)
Your for loop is tabulating the items in m.ids and m[,3] so think of using table(). E.g., replace res <- matrix(0, nr=length(v1), nc=length(v2), dimnames=list(v1, v2)) for(i in 1:nrow(m)) { x <- m.ids[i] y <- m[i,3] res[x, y] <- res[x, y] + 1 } with res<-table(factor(m.ids,levels=v1), factor(m[,3])) There is a bit of trickiness in putting this table into the data.frame. Since as.data.frame(tableObject) works very differently than as.data.frame(matrixObject), the naive data.frame(m.12.unique[,1], m.12.unique[,2], res, row.names=NULL) fails. You need to convert the table res into a matrix with the same data, dimensions, and dimnames. data.frame(m.12.unique[,1], m.12.unique[,2], as.matrix(res), row.names=NULL) also fails because a "table" object is a "matrix" object so as.matrix(tableObject) returns its input, unchanged. as(res,"matrix") seems to work, as the the wordier but more explicit array(res,dim(res),dimnames(res)). res1 <- function(m) { m.12.unique <- unique(m[,1:2]) m.12.unique <- m.12.unique[order(m.12.unique[,1], m.12.unique[,2]),] v1 <- paste(m.12.unique[,1], m.12.unique[,2], sep=".") v2 <- c(sort(unique(m[,3]))) res <- matrix(0, nr=length(v1), nc=length(v2), dimnames=list(v1, v2)) m.ids <- paste(m[,1], m[,2], sep=".") res <- table(factor(m.ids,levels=v1), factor(m[,3])) res <- data.frame(m.12.unique[,1], m.12.unique[,2], as(res, "matrix"), row.names=NULL) colnames(res) <- c("Name", "Type", v2) return(res) } Here is a table of times for your original function, time0, and this modified one, time0. It looks like res1 eventually becomes worse than linear, but for a much larger size than your original. sort() and unique() cannot have linear time so they may be becoming factors at size=1e6. size time0 time1 1 10 0.012 0.012 2 100 0.032 0.014 3 200 0.061 0.016 4 400 0.126 0.020 5 800 0.286 0.028 6 1000 0.383 0.033 7 2000 2.337 0.054 8 4000 8.578 0.100 9 8000 39.955 0.214 10 10000 68.767 0.318 11 20000 327.973 1.057 12 100000 NA 3.021 12 1000000 NA 89.881 Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.