Below is the script that was based on your help - it works very fast: ### Creating the example data set: set.seed(123) MyData<-data.frame(group=c(rep("first",10),rep("second",10)),a=abs(round(rnorm(20,mean=0, sd=.55),2)), b=abs(round(rnorm(20,mean=0, sd=.55),2))) MyData
### Specifying parameters used in the code below: vars<-names(MyData)[2:3] # names of variables to be transformed nr.vars<-length(vars) # number of variables to be transformed group.var<-names(MyData)[1] # name of the grouping variable ### For EACH subgroup: indexing variables a and b to their maximum in that subgroup; ### These indexed variables will be used to build the new ones: system.time({ temp <- cbind(MyData, do.call(cbind, lapply(vars, function(x){ #x<-"b" unlist(by(MyData, MyData[[group.var]], function(y) y[,x] / max(y[,x]))) }))) colnames(temp)[(length(MyData)+1):(length(MyData)+nr.vars)] <- paste(vars, 'IndToMax', sep = '.') }) # Grabbing names of the newly created variables that end with "IndToMax" indexed.vars<-names(temp)[grep("IndToMax$", names(temp))] # variables indexed to subgroup max # Specifying parameters used for transformation below: old.length<-length(temp) hl<-c(.3,.6,1:5) hrf<-seq(.15,.90,.15) ### Actual Transformation: library(fortunes) # will use function Reduce from the package "fortunes" system.time({ constants <- expand.grid(vars = indexed.vars, HL = hl, HRF = hrf) results <- lapply(seq(nrow(constants)), function(x){ dat <- temp[, as.character(constants[x, 1])] D <- exp(log(0.5) / constants[x, 2]) L <- -10 * log(1 - constants[x, 3]) unlist(by(dat, temp[[group.var]], function(y) # function Reduce Reduce(function(u, v) 1 - ((1 - u * D) / (exp(v * L))), y, accumulate = T, init = 0)[-1])) }) final <- cbind(temp, do.call(cbind, results)) colnames(final)[-(1:old.length)] <- paste(vars, constants$HL, 100*constants$HRF, '.transformed', sep = '.') }) Thanks again for all your help! Dimitri On Mon, Mar 29, 2010 at 4:16 PM, Dimitri Liakhovitski <ld7...@gmail.com> wrote: > Would like to thank every one once more for your great help. > I was able to reduce the time from god knows how many hours to about 2 > minutes! > Really appreciate it! > Dimitri > > On Sat, Mar 27, 2010 at 11:43 AM, Martin Morgan <mtmor...@fhcrc.org> wrote: >> On 03/26/2010 06:40 PM, Dimitri Liakhovitski wrote: >>> My sincere apologies if it looked large. Let me try again with less code. >>> It's hard to do less than that. In fact - there is nothing in this >>> code but 1 formula and many loops, which is the problem I am not sure >>> how to solve. >>> I also tried to be as clear as possible with the comments. >>> Dimitri >>> >>> ## START OF THE CODE TO PRODUCE SMALL DATA EXAMPLE >>> set.seed(123) >>> >>> data<-data.frame(group=c(rep("first",10),rep("second",10)),a=abs(round(rnorm(20,mean=0, >>> sd=.55),2)), b=abs(round(rnorm(20,mean=0, sd=.55),2))) >>> data # "data" it is the data frame to work with >>> ## END OF THE CODE TO PRODUCE SMALL DATA EXAMPLE. In real life "data" >>> would contain up to 150-200 rows PER SUBGROUP >>> >>> ### Specifying useful parameters used in the slow code below: >>> vars<-names(data)[2:3] # names of variables used in >>> transformation; in real life - up to 50-60 variables >>> group.var<-names(data)[1] # name of the grouping variable >>> subgroups<-levels(data[[group.var]]) # names of subgroups; in real >>> life - up to 30 subgroups >>> >>> # OBJECTIVE: >>> # Need to create new variables based on the old ones (a & b) >>> # For each new variable, the value in a given row is a function of (a) >>> 2 constants (that have several levels each), >>> # (b) value of the original variable (e.g., a.ind.to.max"), and the >>> value in the previous row on the same new variable >>> # Plus - it has to be done by subgroup (variable "group") >>> >>> # Defining 2 constants: >>> constant1<-c(1:3) # constant 1 used in transformation - >>> has 3 levels, in real life - up to 7 levels >>> constant2<-seq(.15,.45,.15) # constant 2 used in transformation - has >>> 3 levels, in real life - up to 7 levels >>> >>> ### CODE THAT IS SLOW. Reason - too many loops with the inner-most >>> loop being very slow - as it is looping through rows: >>> >>> for(var in vars){ # looping through variables >>> for(c1 in 1:length(constant1)){ # looping through values of >>> constant1 >>> for(c2 in 1:length(constant2)){ # looping through values of >>> constant2 >>> d=log(0.5)/constant1[c1] >>> l=-log(1-constant2[c2]) >>> name<-paste(var,constant1[c1],constant2[c2]*100,".transf",sep=".") >>> data[[name]]<-NA >>> for(subgroup in subgroups){ # looping through subgroups >>> data[data[[group.var]] %in% subgroup, name][1] = >>> 1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup, >>> var][1]*l*10))) >>> >>> ### THIS SECTION IS THE SLOWEST - BECAUSE I AM LOOPING THROUGH ROWS: >>> for(case in 2:nrow(data[data[[group.var]] %in% subgroup, >>> ])){ # looping through rows >>> data[data[[group.var]] %in% subgroup, name][case]= >>> 1-((1-data[data[[group.var]] %in% subgroup, >>> name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup, >>> var][case]*l*10))) >>> } >>> ### END OF THE SLOWEST SECTION (INNERMOST LOOP) >> >> This is a good place to start! Look at the inner loop >> >> I notice there are a lot of expressions whose values don't change, but >> are calculated repeatedly, e.g., exp(1) and data[[group.var]] %in% >> subgroup. Let's move those common subexpressions out of the loop, e.g., >> >> exp1 <- exp(1) >> idx <- data[[group.var]] %in% subgroup >> >> for(case in 2:nrow(data[idx,])) { >> data[idx, name][case]= >> 1-((1-data[idx, name][case-1]*exp1^d) / >> (exp1^(data[idx, var][case]*l*10))) >> } >> >> wow that's already looking better! Looks like data[idx,var] could also >> come out of the loop, and data[idx, name], too, so long as we assign >> back in at the end >> >> exp1 <- exp(1) >> idx <- data[[group.var]] %in% subgroup >> a <- data[idx, var] >> x <- data[idx, name] >> >> for(case in 2:length(x)) { >> x[case] <- 1 - (1 - x[case-1]*exp1^d) / >> (exp1^(a[case]*l*10)) >> } >> >> data[idx, name] <- x >> >> Still some redundant calculations, e.g., exp1^d, l*10. Some of the >> things we've moved out of the innermost loop can be moved even further >> out (e.g., exp1 all the way to the top). And a calculation like >> >> exp1^(a[case] * l * 10) >> >> would be better written out of the loop as >> >> b <- exp1^(a * l * 10) >> >> Why not try re-implementing your code with these ideas in mind? It'll be >> important to make sure that the results of your original and new >> implementation are the same (using all.equal, for instance). The 'CODE >> THAT IS SLOW' part of your original code took 0.35s to execute on my >> machine (using system.time), whereas my revised code with changes like >> that above, and a couple of other small things, ran in 0.01s. >> >> What's your revised code look like now? >> >> Martin >> >>> } >>> } >>> } >>> } >>> >>> ### END OF THE CODE >>> >>> On Fri, Mar 26, 2010 at 5:25 PM, Bert Gunter <gunter.ber...@gene.com> wrote: >>>> Dmitri: >>>> >>>> If you follow the R posting guide you're more likely to get useful replies. >>>> In particular it asks for **small** reproducible examples -- your example >>>> is >>>> far more code then I care to spend time on anyway (others may be more >>>> willing or more able to do so of course). I suggest you try (if you haven't >>>> already): >>>> >>>> 1. Profiling the code using Rprof to isolate where the time is spent.And >>>> then... >>>> >>>> 2. Writing a **small** reproducible example to exercise that portion of the >>>> code and post it with your question to the list. If you need to... >>>> Typically, if you do these things you'll figure out how to fix the >>>> situation on your own. >>>> >>>> Cheers, >>>> >>>> Bert Gunter >>>> Genentech Nonclinical Statistics >>>> >>>> -----Original Message----- >>>> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On >>>> Behalf Of Dimitri Liakhovitski >>>> Sent: Friday, March 26, 2010 2:06 PM >>>> To: r-help >>>> Subject: [R] Competing with SPSS and SAS: improving code that loops >>>> throughrows (data manipulation) >>>> >>>> Dear R-ers, >>>> >>>> In my question there are no statistics involved - it's all about data >>>> manipulation in R. >>>> I am trying to write a code that should replace what's currently being >>>> done in SAS and SPSS. Or, at least, I am trying to show to my >>>> colleagues R is not much worse than SAS/SPSS for the task at hand. >>>> I've written a code that works but it's too slow. Probably because >>>> it's looping through a lot of things. But I am not seeing how to >>>> improve it. I've already written a different code but it's 5 times >>>> slower than this one. The code below takes me slightly above 5 sec for >>>> the tiny data set. I've tried using it with a real one - was not done >>>> after hours. >>>> Need help of the list! Maybe someone will have an idea on how to >>>> increase the efficiency of my code (just one block of it - in the >>>> "DATA TRANSFORMATION" Section below)? >>>> >>>> Below - I am creating the data set whose structure is similar to the >>>> data sets the code should be applied to. Also - I have desribed what's >>>> actually being done - in comments. >>>> Thanks a lot to anyone for any suggestion! >>>> >>>> Dimitri >>>> >>>> ###### CREATING THE TEST DATA SET ################################ >>>> >>>> set.seed(123) >>>> data<-data.frame(group=c(rep("first",10),rep("second",10)),week=c(1:10,1:10) >>>> ,a=abs(round(rnorm(20)*10,0)), >>>> b=abs(round(rnorm(20)*100,0))) >>>> data >>>> dim(data)[1] # !!! In real life I might have up to 150 (!) rows >>>> (weeks) within each subgroup >>>> >>>> ### Specifying parameters used in the code below: >>>> vars<-names(data)[3:4] # names of variables to be transformed >>>> nr.vars<-length(vars) # number of variables to be transformed; !!! >>>> in real life I'll have to deal with up to 50-60 variables, not 2. >>>> group.var<-names(data)[1] # name of the grouping variable >>>> subgroups<-levels(data[[group.var]]) # names of subgroups; !!! in >>>> real life I'll have up to 20-25 subgroups, not 2. >>>> >>>> # For EACH subgroup: indexing variables a and b to their maximum in >>>> that subgroup; >>>> # Further, I'll have to use these indexed variables to build the new ones: >>>> for(i in vars){ >>>> new.name<-paste(i,".ind.to.max",sep="") >>>> data[[new.name]]<-NA >>>> } >>>> >>>> indexed.vars<-names(data)[grep("ind.to.max$", names(data))] # >>>> variables indexed to subgroup max >>>> for(subgroup in subgroups){ >>>> data[data[[group.var]] %in% >>>> subgroup,indexed.vars]<-lapply(data[data[[group.var]] %in% >>>> subgroup,vars],function(x){ >>>> y<-x/max(x) >>>> return(y) >>>> }) >>>> } >>>> data >>>> >>>> ############# DATA TRANSFORMATION ######################################### >>>> >>>> # Objective: Create new variables based on the old ones (a and b >>>> ind.to.max) >>>> # For each new variable, the value in a given row is a function of (a) >>>> 2 constants (that have several levels each), >>>> # (b) the corresponding value of the original variable (e.g., >>>> a.ind.to.max"), and the value in the previous row on the same new >>>> variable >>>> # PLUS: - it has to be done by subgroup (variable "group") >>>> >>>> constant1<-c(1:3) # constant 1 used for transformation - >>>> has 3 levels; !!! in real life it will have up to 7 levels >>>> constant2<-seq(.15,.45,.15) # constant 2 used for transformation - >>>> has 3 levels; !!! in real life it will have up to 7 levels >>>> >>>> # CODE THAT IS TOO SLOW (it uses parameters specified in the previous >>>> code section): >>>> start1<-Sys.time() >>>> for(var in indexed.vars){ # looping through variables >>>> for(c1 in 1:length(constant1)){ # looping through levels of constant1 >>>> for(c2 in 1:length(constant2)){ # looping through levels of >>>> constant2 >>>> d=log(0.5)/constant1[c1] >>>> l=-log(1-constant2[c2]) >>>> >>>> name<-paste(strsplit(var,".ind.to.max"),constant1[c1],constant2[c2]*100,"..t >>>> ransf",sep=".") >>>> data[[name]]<-NA >>>> for(subgroup in subgroups){ # looping through subgroups >>>> data[data[[group.var]] %in% subgroup, name][1] = >>>> 1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup, >>>> var][1]*l*10))) # this is just the very first row of each subgroup >>>> for(case in 2:nrow(data[data[[group.var]] %in% subgroup, ])){ >>>> # looping through the remaining rows of the subgroup >>>> data[data[[group.var]] %in% subgroup, name][case]= >>>> 1-((1-data[data[[group.var]] %in% subgroup, >>>> name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup, >>>> var][case]*l*10))) >>>> } >>>> } >>>> } >>>> } >>>> } >>>> end1<-Sys.time() >>>> print(end1-start1) # Takes me ~0.53 secs >>>> names(data) >>>> data >>>> >>>> -- >>>> Dimitri Liakhovitski >>>> Ninah.com >>>> dimitri.liakhovit...@ninah.com >>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>>> >>> >>> >>> >> >> >> -- >> Martin Morgan >> Computational Biology / Fred Hutchinson Cancer Research Center >> 1100 Fairview Ave. N. >> PO Box 19024 Seattle, WA 98109 >> >> Location: Arnold Building M1 B861 >> Phone: (206) 667-2793 >> > > > > -- > Dimitri Liakhovitski > Ninah.com > dimitri.liakhovit...@ninah.com > -- Dimitri Liakhovitski Ninah.com dimitri.liakhovit...@ninah.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.