Hi,
I have a big data frame: > mat <- matrix(rep(paste(letters, collapse=""), 5*300000), ncol=5) > dat <- as.data.frame(mat) and I need to do some computation on each row. Currently I'm doing this: > for (key in row.names(dat)) { row <- dat[key, ]; ... do some computation on row... } which could probably considered a very natural (and R'ish) way of doing it (but maybe I'm wrong and the real idiom for doing this is something different). The problem with this "idiomatic form" is that it is _very_ slow. The loop itself + the simple extraction of the rows (no computation on the rows) takes 10 hours on a powerful server (quad core Linux with 8G of RAM)! Looping over the first 100 rows takes 12 seconds: > system.time(for (key in row.names(dat)[1:100]) { row <- dat[key, ] }) user system elapsed 12.637 0.120 12.756 But if, instead of the above, I do this: > for (i in nrow(dat)) { row <- sapply(dat, function(col) col[i]) } then it's 20 times faster!! > system.time(for (i in 1:100) { row <- sapply(dat, function(col) col[i]) }) user system elapsed 0.576 0.096 0.673 I hope you will agree that this second form is much less natural. So I was wondering why the "idiomatic form" is so slow? Shouldn't the idiomatic form be, not only elegant and easy to read, but also efficient? Thanks, H. > sessionInfo() R version 2.5.0 Under development (unstable) (2007-01-05 r40386) x86_64-unknown-linux-gnu locale: LC_CTYPE=en_US;LC_NUMERIC=C;LC_TIME=en_US;LC_COLLATE=en_US;LC_MONETARY=en_US;LC_MESSAGES=en_US;LC_PAPER=en_US;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US;LC_IDENTIFICATION=C attached base packages: [1] "stats" "graphics" "grDevices" "utils" "datasets" "methods" [7] "base" ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel