Your original code works far faster when the input is a matrix that when it is a data.frame. Selecting a row from a data.frame is a very slow operation, selecting a row from a matrix is quick. Modifying a row or a single element in a data.frame is even worse compared to do it on a matrix. I compared your original code:
f0 <- function (df) { for (i in seq_len(nrow(df))) { t = which(df[i, 2:24] > df[i, 25]) r = min(t) df[i, 26] = (r - 1) * 16 + 1 } df[, 26] # for now just return the computed column } to one that converts relevant parts of the data.frame df to matrices or vectors before doing the loop over rows: f0.a <- function (df) { thold <- df[, "thold"] tmp <- as.matrix(df[,2:24]) ans <- df[,26] for (i in seq_len(nrow(df))) { t = which(tmp[i,]>thold[i]) r = min(t) ans[i] = (r-1)*16+1 } # df[,26] <- ans ans } On a 10,000 row data.frame f0 took 47.950 seconds and f0.a took 0.140 seconds. (f1, below, took 0.012 seconds.) Bill Dunlap Spotfire, TIBCO Software wdunlap tibco.com > -----Original Message----- > From: William Dunlap > Sent: Monday, October 17, 2011 11:00 AM > To: Nathan Piekielek > Cc: r-help@r-project.org > Subject: RE: [R] how to use 'which' inside of 'apply'? > > Try vectorizing it a bit by looping over the columns. > E.g., > > f1 <- function (df) > { > # loop (backwards) over all columns in df whose > # names start with "D" to find the earliest one > # that is bigger than column "thold". I tested with > # df being a data.frame but a matrix should work too. > i <- rep(NA_character_, nrow(df)) > colNames <- grep(value = TRUE, "^D", colnames(df)) > for (colName in rev(colNames)) { > i[df[, colName] > df[, "thold"]] <- colName > } > # convert column name "D<number>" to <number>. > doy <- as.numeric(sub("^D", "", i)) > doy > } > > f1(a) > [1] 129 145 129 177 177 177 > > You could also try looping over rows with something like > findInterval. If there are far fewer columns than rows > then looping over columns is generally faster. > > Your sample data.frame I called 'a' and in copy-and-pastable > form (from dput()) is > a <- structure(list(pt = c(39177L, 39178L, 39164L, 39143L, 39144L, > 39146L), D1 = c(0L, 0L, 0L, 0L, 0L, 0L), D17 = c(0L, 0L, 0L, > 0L, 0L, 0L), D33 = c(0L, 0L, 0L, 0L, 0L, 0L), D49 = c(0L, 0L, > 0L, 0L, 0L, 0L), D65 = c(0L, 0L, 0L, 0L, 0L, 0L), D81 = c(0L, > 0L, 0L, 0L, 0L, 0L), D97 = c(0L, 0L, 0L, 0L, 0L, 0L), D113 = c(0L, > 0L, 0L, 0L, 0L, 0L), D129 = c(0.4336, 0.342, 0.483, 0.3088, 0.339, > 0.4232), D145 = c(0.4754, 0.4543, 0.4943, 0.3753, 0.4152, 0.4442 > ), D161 = c(0.5340667, 0.5397666, 0.5740333, 0.4466, 0.5147, > 0.5084), D177 = c(0.5927334, 0.6252333, 0.6537667, 0.5179, 0.6142, > 0.5726), D193 = c(0.6514, 0.7107, 0.7335, 0.5892, 0.7137, 0.6368 > ), D209 = c(0.6966, 0.7123, 0.6255, 0.6468, 0.6914, 0.5896), > D225 = c(0.59, 0.5591, 0.6228, 0.4794, 0.6381, 0.4703), D241 = c(0.5583, > 0.4617, 0.5255, 0.4411, 0.5704, 0.4936), D257 = c(0.5676, > 0.4206, 0.5436, 0.4307, 0.5619, 0.5353), D273 = c(0.4682, > 0.3867, 0.5541, 0.3632, 0.5347, 0.4067), D289 = c(0.35115, > 0.2578, 0.46195, 0.34355, 0.4976, 0.39685), D305 = c(0.2341, > 0.1289, 0.3698, 0.3239, 0.4605, 0.387), D321 = c(0.11705, > 0, 0.1849, 0, 0, 0), D337 = c(0L, 0L, 0L, 0L, 0L, 0L), D353 = c(0L, > 0L, 0L, 0L, 0L, 0L), thold = c(0.406825, 0.4206, 0.4592, > 0.4778, 0.52635, 0.5119), doy = c(0L, 0L, 0L, 0L, 0L, 0L)), .Names = > c("pt", > "D1", "D17", "D33", "D49", "D65", "D81", "D97", "D113", "D129", > "D145", "D161", "D177", "D193", "D209", "D225", "D241", "D257", > "D273", "D289", "D305", "D321", "D337", "D353", "thold", "doy" > ), class = "data.frame", row.names = c("1", "2", "3", "4", "5", > "6")) > > > Bill Dunlap > Spotfire, TIBCO Software > wdunlap tibco.com > > > -----Original Message----- > > From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On > > Behalf Of R. Michael > > Weylandt > > Sent: Monday, October 17, 2011 10:32 AM > > To: Nathan Piekielek > > Cc: r-help@r-project.org > > Subject: Re: [R] how to use 'which' inside of 'apply'? > > > > I think something like this should do it at a huge speed up, though > > I'd advise you check it to make sure it does exactly what you want: > > there's also nothing to guarantee that something beats the threshold, > > so that might make the whole thing fall apart (though I don't think it > > will) > > > > # Sample data > > df = data.frame(x = sample(5, 15,T), > > y = sample(5, 15, T), > > z = sample(5, 15,T), > > w = (1:5)/2 + 0.5, > > th = (1:5)/2, > > doy = rep(0,15)) > > > > wd <- which(df[,1:4] > df[,5], arr.ind = TRUE) > > # identify all elements that beat the threshold value by their indices > > > > wd <- wd[!duplicated(wd[,1]),] > > # select only the first appearance of each "row" value in wd -- this > > keeps the earliest column beating the threshold > > > > wd <- wd[order(wd[,"row"]),] > > # sort them by row > > > > df$doy = (wd[,"col"]-1)*16 + 1 > > # The column transform you used. > > > > Hope this helps, > > > > Michael > > > > > > On Mon, Oct 17, 2011 at 1:03 PM, Nathan Piekielek <npiekie...@gmail.com> > > wrote: > > > Hello R-community, > > > > > > I am trying to populate a column (doy) in a large dataset with the first > > > column number that exceeds the value in another column (thold) using the > > > 'apply' function. > > > > > > Sample data: > > > pt D1 D17 D33 D49 D65 D81 D97 D113 D129 D145 D161 D177 > > > D193 D209 D225 D241 D257 > > > 1 39177 0 0 0 0 0 0 0 0 0.4336 0.4754 0.5340667 0.5927334 > > > 0.6514 0.6966 0.5900 0.5583 0.5676 > > > 2 39178 0 0 0 0 0 0 0 0 0.3420 0.4543 0.5397666 0.6252333 > > > 0.7107 0.7123 0.5591 0.4617 0.4206 > > > 3 39164 0 0 0 0 0 0 0 0 0.4830 0.4943 0.5740333 0.6537667 > > > 0.7335 0.6255 0.6228 0.5255 0.5436 > > > 4 39143 0 0 0 0 0 0 0 0 0.3088 0.3753 0.4466000 0.5179000 > > > 0.5892 0.6468 0.4794 0.4411 0.4307 > > > 5 39144 0 0 0 0 0 0 0 0 0.3390 0.4152 0.5147000 0.6142000 > > > 0.7137 0.6914 0.6381 0.5704 0.5619 > > > 6 39146 0 0 0 0 0 0 0 0 0.4232 0.4442 0.5084000 0.5726000 > > > 0.6368 0.5896 0.4703 0.4936 0.5353 > > > D273 D289 D305 D321 D337 D353 thold doy > > > 1 0.4682 0.35115 0.2341 0.11705 0 0 0.406825 0 > > > 2 0.3867 0.25780 0.1289 0.00000 0 0 0.420600 0 > > > 3 0.5541 0.46195 0.3698 0.18490 0 0 0.459200 0 > > > 4 0.3632 0.34355 0.3239 0.00000 0 0 0.477800 0 > > > 5 0.5347 0.49760 0.4605 0.00000 0 0 0.526350 0 > > > 6 0.4067 0.39685 0.3870 0.00000 0 0 0.511900 0 > > > > > > For the first record in above example I would expect doy = 129. > > > > > > I can achieve this with the following loop, but it takes several days to > > > run > > > and there must be a more efficient solution: > > > > > > for (i in (1:152000)) { > > > t=which(data[i,2:24]>data[i,25]) > > > r=min(t) > > > data[i,26]=(r-1)*16+1 > > > } > > > > > > How do I write this using 'apply' or another function that will be more > > > efficient? > > > > > > I have tried the following: > > > data$doy=apply(which(data[,2:24]>data[,25]),1,min) > > > > > > Which returns the following error message: > > > "Error in apply(which(new[, 2:24] > new[, 25]), 1, min) : > > > dim(X) must have a positive length" > > > > > > Any help would be much appreciated. > > > > > > Nathan > > > > > > ______________________________________________ > > > R-help@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, reproducible code. > > > > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.