Hi: Your question about efficiency does not seem well-posed to me. Efficient relative to what criterion? Rather than to address your question directly, I'll show how different possible situations that could arise in the general context of your problem can be addressed.
One of the first rules in R programming is to learn the concepts of vectorization and indexing. This saves a lot of code down the line. R is not C(++) or Java, and it shouldn't be programmed as though it were. As a result, iterative approaches to problem solving in R are usually, but not always, inefficient. R has many vectorized functions which should be used whenever possible. Usually, the apply family of functions or one of the summarization packages (notably data.table, doBy and plyr, although there are others) can be exploited to recursively apply a function to different subsets of data. Consider three different situations below in which one might want to apply a t-test. Only one uses iteration. I'm using the plyr package because it is most flexible in terms of the types of input and output objects it can process. Let's start by manufacturing some matrix data: ## function to generate a matrix mgen <- function() matrix(rnorm(50), nrow = 10) ## use replicate() to generate an array marr <- replicate(4, mgen()) # a 10 x 5 x 4 array marr # A matrix of column indices to use in t.test() tcols <- matrix(c(1, 2, 1, 3, 1, 4, 1, 5), ncol = 2, byrow = TRUE) colnames(tcols) <- c('i', 'j') tcols # ------------------------ # Situation 1: multiple matrices, test the same pair # of columns in each, in this case 2 and 4. # The input argument m is a matrix. A data frame is # returned because that's what the adply() function in # the plyr package expects as output (a = array input, # d = data frame output) tfun1 <- function(m) { v <- t.test(m[, 2], m[, 4], var.equal = TRUE) data.frame(tstat = v$statistic, pval = v$p.value) } # adply takes the input array marr, iterates over the third index # and applies tfun1 to each marginal matrix res1 <- adply(marr, 3, tfun1) res1 # ------------------------ # Situation 2: one matrix, test multiple pairs of columns mat <- mgen() # generate a single matrix tfun2 <- function(i, j) { v <- t.test(mat[, i], mat[, j], var.equal = TRUE) data.frame(tstat = v$statistic, pval = v$p.value) } # mdply() takes the matrix of column indices as its first # argument. Notice that tfun2 was written so that its # arguments are i and j, the column names of tcols. # This is required, and the order matters. For each # row of tcols, the function tfun2 is applied to the # matrix mat. res2 <- mdply(tcols, tfun2) res2 # ------------------- # Situation 3: n matrices, different pairs of columns # tested in each # The idea is to perform a t-test on different pairs of # columns in each submatrix of marr. # The simplest thing to do in this situation is to # iterate, although there is probably some clever way to # do this using nested apply family calls. The reason for # iteration is that we want to operate on the same # relevant index of *both* marr and tcols. It's possible to # use mapply() for this task, but that would take more # explanation and this is long-winded enough. outmat <- matrix(NA, nrow = nrow(tcols), ncol = 4) for(k in seq_len(nrow(tcols))) { mat <- marr[, , k] # take k-th submatrix of marr cols <- tcols[k, ] # take k-th row of tcols v <- t.test(mat[, cols[1]], mat[, cols[2]], var.equal = TRUE) outmat[k, ] <- c(cols[1], cols[2], v$statistic, v$p.value) } colnames(outmat) <- c('col1', 'col2', 'tstat', 'pval') outmat Notice that the type of input matters, so the way in which the data are arranged has much to do with the way you program in R, especially with the apply family of functions and their offshoots in different packages. The basic programming strategy is to write a utility function that works for a generic subset of the input data, and then use one of the **ply() functions or functions in the apply family to map the function to different data subsets. HTH, Dennis On Thu, Aug 4, 2011 at 8:19 PM, Matt Curcio <matt.curcio...@gmail.com> wrote: > Greetings all, > I am curious to know if either of these two sets of code is more efficient? > > Example1: > ## t-test ## > colA <- temp [ , j ] > colB <- temp [ , k ] > ttr <- t.test ( colA, colB, var.equal=TRUE) > tt_pvalue [ i ] <- ttr$p.value > > or > Example2: > tt_pvalue [ i ] <- t.test ( temp[ , j ], temp[ , k ], var.equal=TRUE) > ------------- > I have three loops, i, j, k. > One to test the all of <i> files in a directory. One to tease out > column <j> and compare it by means of t-test to column <k> in each of > the files. > --------------- > for ( i in 1:num_files ) { > temp <- read.table ( files_to_test [ i ], header=TRUE, sep="\t") > num_cols <- ncol ( temp ) > ## Define Columns To Compare ## > for ( j in 2 : num_cols ) { > for ( k in 3 : num_cols ) { > ## t-test ## > colA <- temp [ , j ] > colB <- temp [ , k ] > ttr <- t.test ( colA, colB, var.equal=TRUE) > tt_pvalue [ i ] <- ttr$p.value > } > } > } > -------------------------------- > I am a novice writer of code and am interested to hear if there are > any (dis)advantages to one way or the other. > M > > > Matt Curcio > M: 401-316-5358 > E: matt.curcio...@gmail.com > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.