Re: [R] Efficiency Question - Nested lapply or nested for loop

David Winsemius Fri, 08 Oct 2010 09:29:28 -0700

You are loosing a lot of time by repeatedly calculating characterindices with paste() in every iteration. Two options:

-- 1) calculate these once outside the loop and then refer to them byindex


idx.names <- vector(mode="character", length=nind)
for (i in (0:(nind-1))) {idx[i+1] <-    # need the offset
       c(paste("G_hat_0_",i,sep=""),
        paste("G_hat_1_",i,sep=""),
        paste("G_hat_2_",i,sep=""),
        paste("G_",i,sep="") ) }

Then the inner loop would be:
for (i in (0:(nind-1))) {
      Gmax = which.max(c(data[[ idx.names[1] ]][row],
                         data[[ idx.names[2] ]][row],
                         data[[ idx.names[3] ]][row] ))

        Gtru = data[[ idx.names[4] ]][row] + 1  # add 1 to match Gmax range
                       }

And as has been said many times before,...
require(fortunes)
fortune("dog")

-- 2) probably even faster to pre-calculate (or just construct byinspection) those column indices as a numeric vector and use thenaccess with data[row, numidxs[i] ]

The for-loop is generally going to be faster than an lapply solution.The fastest solution would be a fully indexed strategy, which mightbecome more apparent (it's not yet so to me) after you implement thesecond option above.


--
David.

On Oct 8, 2010, at 11:35 AM, epowell wrote:

My data looks like this:
data
name G_hat_0_0 G_hat_1_0 G_hat_2_0 G_0 G_hat_0_1 G_hat_1_1G_hat_2_1 G_11 rs0 0.488000 0.448625 0.063375 1 0.480875 0.4545000.064625 12 rs1 0.002375 0.955375 0.042250 1 0.000000 0.0628750.937125 23 rs2 0.050375 0.835875 0.113750 1 0.877250 0.1158750.006875 04 rs3 0.000000 0.074750 0.925250 2 0.897750 0.1020000.000250 05 rs4 0.000125 0.052375 0.947500 2 0.261500 0.7241250.014375 16 rs5 0.003750 0.092125 0.904125 2 0.023000 0.7381250.238875 1
And my task is:
For each individual (X) on each row, to find the index correspondingto themax of G_hat_X_0, G_hat_X_1, G_hat_X_2 and then increment the cellof theconfusion matrix with the row corresponding to that index and thecolumn
corresponding to G_X.
For example, in the first row and the first individual, the indexwith the
max value (0.488000) is 0 and the G_0 value is 1, so I would increment
matrix index of the first row and second column. (Note that the ranges
between rows and columns are one off. That is accounted for in thecode.)
In reality the data will be much bigger, containing 10000 rows and a
variable number of columns (inds) between 10 and 500.

The correct result is:
cmat
       tru_rr tru_rv tru_vv
call_rr      2      2      0
call_rv      0      4      0
call_vv      0      0      4
I am not sure what the best way to do this is. I implemented itonce usingtwo for loops. Then I tried to use lapply and came up with a nestedlapplysolution, but it was slower than the simple loops. I still thinkthat thereis a better way and I was hoping for some advice. Perhaps somethingwith
pmax....

#### DATA PREP ##########

data = data.frame(name=c("rs0","rs1","rs2","rs3","rs4","rs5"),
        G_hat_0_0=c(0.488,0.002375,0.050375,0,0.000125,0.00375),
        G_hat_1_0=c(0.448625,0.955375,0.835875,0.07475,0.052375,0.092125),
        G_hat_2_0=c(0.063375,0.04225,0.11375,0.92525,0.9475,0.904125),
        G_0=c(1,1,1,2,2,2),
        G_hat_0_1=c(0.480875,0,0.87725,0.89775,0.2615,0.023),
        G_hat_1_1=c(0.4545,0.062875,0.115875,0.102,0.724125,0.738125),
        G_hat_2_1=c(0.064625,0.937125,0.006875,0.00025,0.014375,0.238875),
        G_1=c(1,2,0,0,1,1))     

# get list of inds in file (e.g. G_0,G_1,...,G_100)
inds = grep("G_[0-9]+",names(data),perl=T,value=T)

# get total number of inds
nind = length(inds)

# create an empty "confusion" table
cmat = matrix(rep(0,9), nrow=3, ncol=3)
colnames(cmat) = c("tru_rr", "tru_rv", "tru_vv")
rownames(cmat) = c("call_rr","call_rv","call_vv")

## APPROACH 1: Nested For Loop ####

# Nested Loop Approach
for (row in (1:nrow(data))) {
for (i in (0:(nind-1))) {

        Gmax = which.max(c( data[[paste("G_hat_0_",i,sep="")]][row],
                                  data[[paste("G_hat_1_",i,sep="")]][row],
                                  data[[paste("G_hat_2_",i,sep="")]][row] ))
Gtru = data[[paste("G_",i,sep="")]][row] + 1 # add 1 to match Gmaxrange
        cmat[Gmax,Gtru] = cmat[Gmax,Gtru] + 1
}
}


## APPROACH 2: Nested lapply ####

# This routine finds the geno w/ highest prob from the erg.avgs.
# and compares it to the true geno. Result is tallied by                
# incrementing the appropriate index of the confusion matrix    

add2cmat <- function(ind,locus) {

        Gmax = which.max(c( data[[paste("G_hat_0_",ind,sep="")]][locus],
                                  data[[paste("G_hat_1_",ind,sep="")]][locus],
                                  data[[paste("G_hat_2_",ind,sep="")]][locus] ))
Gtru = data[[paste("G_",ind,sep="")]][locus] + 1 # add 1 to matchGmax
range
cmat[Gmax,Gtru] <<- cmat[Gmax,Gtru] + 1 # use double arrow tomodify
global env.

}

# Run add2cmat for all individuals on a given locus

add_locus2cmat <- function(locus) {
        lapply(0:(nind-1),add2cmat,locus)
}
junk = lapply((1:nrow(data)),add_locus2cmat) # don't need returnvalue
--
View this message in context: 
http://r.789695.n4.nabble.com/Efficiency-Question-Nested-lapply-or-nested-for-loop-tp2968553p2968553.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Efficiency Question - Nested lapply or nested for loop

Reply via email to