On Tue, Aug 12, 2008 at 04:47:14AM -0400, Michael R. Head wrote: > I have a collection of datasets in separate data frames which have 3 > independent test parameters (w, x, y) and one dependent variable (z) , > together with some additional static test data on each row. What I want > is a data frame which contains the test data, the parameters (w, x, y) > and the mean value of all (z)s in the Z column. > > Each datasets has around 6000 rows and around 7 columns, which doesn't > seem outrageously large, so it seems like this shouldn't too time > consuming, but the way I've been approaching it seems to take way too > long (20 seconds for datasets over 4 runs, longer for my datasets over > 10 runs). > > My imperative-coding brain lead me to use for loops, which seems to be > particularly problematic for R performance. My first attempt at this > looked like the following, which takes roughly 60 seconds to complete. I > rewrote it a little, but the code was much longer and effectively > replaces one of the for loops with an lapply(). I could paste the other > code, but it's much longer and less clear about its intent. >
Hi Michael, > ####################### > # Start code snippet > ####################### > ### inputFiles just a list of paths to the test runs > testRuns <- lapply(inputFiles, > function(x) { > read.table(x, header=TRUE)}) (Just BTW lapply(inputFiles, read.table, header=TRUE) is slightly nicer to look at) > > ### W, X, Y have (small) natural values > w <- unique(testRuns[[1]]$W) > x <- unique(testRuns[[1]]$X) > y <- unique(testRuns[[1]]$Y) > > ### All runs have the same values for all columns > ### with the exception of the Z values, so just > ### copy the first test run data > testMeans <- data.frame(testRuns[[1]]) How about rbind()ing all the data frames together, and working with the combined data frame? Say that testRuns is > testRuns [[1]] W X Y Z 1 1 5 5 -0.5251156 2 5 1 3 1.1761139 3 2 4 4 -0.8934380 4 5 1 1 1.4076303 5 5 3 1 0.4679745 [[2]] W X Y Z 1 1 5 5 -0.8556862 2 5 1 3 0.3517671 3 2 4 4 -1.0202064 4 5 1 1 1.2152349 5 5 3 1 0.4340249 > allRuns <- do.call("rbind", testRuns) > aggregate(allRuns$Z, by=allRuns[c("W","X","Y")], mean) W X Y x 1 5 1 1 1.3114326 2 5 3 1 0.4509997 3 5 1 3 0.7639405 4 2 4 4 -0.9568222 5 1 5 5 -0.6904009 Dan > for(w0 in w) { > for(y0 in y) { > for (x0 in x) { > row <- which(testMeans$W == w0 & > testMeans$Y == y0 & > testMeans$X == x0) > meanValues <- sapply(testRuns, > function(r) > {mean( subset(r, > r$W == w0 & > r$Y == y0 & > r$X == x0)$Z )}) > testMeans[row,]$Z = mean(meanValues) > } > } > } > ### I will then want to plot certain values over (X, Z), > ### so ultimately, I'm going to subset the data further. > ### Code which gives me a list of W tables with mean Z values > ### works, too. > ####################### > # End code snippet > ####################### > > > Thanks, > mike > > -- > Michael R. Head <[EMAIL PROTECTED]> > http://www.cs.binghamton.edu/~mike/ > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- www.stats.ox.ac.uk/~davison ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.