Would like to thank every one once more for your great help. I was able to reduce the time from god knows how many hours to about 2 minutes! Really appreciate it! Dimitri
On Sat, Mar 27, 2010 at 11:43 AM, Martin Morgan <mtmor...@fhcrc.org> wrote: > On 03/26/2010 06:40 PM, Dimitri Liakhovitski wrote: >> My sincere apologies if it looked large. Let me try again with less code. >> It's hard to do less than that. In fact - there is nothing in this >> code but 1 formula and many loops, which is the problem I am not sure >> how to solve. >> I also tried to be as clear as possible with the comments. >> Dimitri >> >> ## START OF THE CODE TO PRODUCE SMALL DATA EXAMPLE >> set.seed(123) >> >> data<-data.frame(group=c(rep("first",10),rep("second",10)),a=abs(round(rnorm(20,mean=0, >> sd=.55),2)), b=abs(round(rnorm(20,mean=0, sd=.55),2))) >> data # "data" it is the data frame to work with >> ## END OF THE CODE TO PRODUCE SMALL DATA EXAMPLE. In real life "data" >> would contain up to 150-200 rows PER SUBGROUP >> >> ### Specifying useful parameters used in the slow code below: >> vars<-names(data)[2:3] # names of variables used in >> transformation; in real life - up to 50-60 variables >> group.var<-names(data)[1] # name of the grouping variable >> subgroups<-levels(data[[group.var]]) # names of subgroups; in real >> life - up to 30 subgroups >> >> # OBJECTIVE: >> # Need to create new variables based on the old ones (a & b) >> # For each new variable, the value in a given row is a function of (a) >> 2 constants (that have several levels each), >> # (b) value of the original variable (e.g., a.ind.to.max"), and the >> value in the previous row on the same new variable >> # Plus - it has to be done by subgroup (variable "group") >> >> # Defining 2 constants: >> constant1<-c(1:3) # constant 1 used in transformation - >> has 3 levels, in real life - up to 7 levels >> constant2<-seq(.15,.45,.15) # constant 2 used in transformation - has >> 3 levels, in real life - up to 7 levels >> >> ### CODE THAT IS SLOW. Reason - too many loops with the inner-most >> loop being very slow - as it is looping through rows: >> >> for(var in vars){ # looping through variables >> for(c1 in 1:length(constant1)){ # looping through values of >> constant1 >> for(c2 in 1:length(constant2)){ # looping through values of >> constant2 >> d=log(0.5)/constant1[c1] >> l=-log(1-constant2[c2]) >> name<-paste(var,constant1[c1],constant2[c2]*100,".transf",sep=".") >> data[[name]]<-NA >> for(subgroup in subgroups){ # looping through subgroups >> data[data[[group.var]] %in% subgroup, name][1] = >> 1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup, >> var][1]*l*10))) >> >> ### THIS SECTION IS THE SLOWEST - BECAUSE I AM LOOPING THROUGH ROWS: >> for(case in 2:nrow(data[data[[group.var]] %in% subgroup, >> ])){ # looping through rows >> data[data[[group.var]] %in% subgroup, name][case]= >> 1-((1-data[data[[group.var]] %in% subgroup, >> name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup, >> var][case]*l*10))) >> } >> ### END OF THE SLOWEST SECTION (INNERMOST LOOP) > > This is a good place to start! Look at the inner loop > > I notice there are a lot of expressions whose values don't change, but > are calculated repeatedly, e.g., exp(1) and data[[group.var]] %in% > subgroup. Let's move those common subexpressions out of the loop, e.g., > > exp1 <- exp(1) > idx <- data[[group.var]] %in% subgroup > > for(case in 2:nrow(data[idx,])) { > data[idx, name][case]= > 1-((1-data[idx, name][case-1]*exp1^d) / > (exp1^(data[idx, var][case]*l*10))) > } > > wow that's already looking better! Looks like data[idx,var] could also > come out of the loop, and data[idx, name], too, so long as we assign > back in at the end > > exp1 <- exp(1) > idx <- data[[group.var]] %in% subgroup > a <- data[idx, var] > x <- data[idx, name] > > for(case in 2:length(x)) { > x[case] <- 1 - (1 - x[case-1]*exp1^d) / > (exp1^(a[case]*l*10)) > } > > data[idx, name] <- x > > Still some redundant calculations, e.g., exp1^d, l*10. Some of the > things we've moved out of the innermost loop can be moved even further > out (e.g., exp1 all the way to the top). And a calculation like > > exp1^(a[case] * l * 10) > > would be better written out of the loop as > > b <- exp1^(a * l * 10) > > Why not try re-implementing your code with these ideas in mind? It'll be > important to make sure that the results of your original and new > implementation are the same (using all.equal, for instance). The 'CODE > THAT IS SLOW' part of your original code took 0.35s to execute on my > machine (using system.time), whereas my revised code with changes like > that above, and a couple of other small things, ran in 0.01s. > > What's your revised code look like now? > > Martin > >> } >> } >> } >> } >> >> ### END OF THE CODE >> >> On Fri, Mar 26, 2010 at 5:25 PM, Bert Gunter <gunter.ber...@gene.com> wrote: >>> Dmitri: >>> >>> If you follow the R posting guide you're more likely to get useful replies. >>> In particular it asks for **small** reproducible examples -- your example is >>> far more code then I care to spend time on anyway (others may be more >>> willing or more able to do so of course). I suggest you try (if you haven't >>> already): >>> >>> 1. Profiling the code using Rprof to isolate where the time is spent.And >>> then... >>> >>> 2. Writing a **small** reproducible example to exercise that portion of the >>> code and post it with your question to the list. If you need to... >>> Typically, if you do these things you'll figure out how to fix the >>> situation on your own. >>> >>> Cheers, >>> >>> Bert Gunter >>> Genentech Nonclinical Statistics >>> >>> -----Original Message----- >>> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On >>> Behalf Of Dimitri Liakhovitski >>> Sent: Friday, March 26, 2010 2:06 PM >>> To: r-help >>> Subject: [R] Competing with SPSS and SAS: improving code that loops >>> throughrows (data manipulation) >>> >>> Dear R-ers, >>> >>> In my question there are no statistics involved - it's all about data >>> manipulation in R. >>> I am trying to write a code that should replace what's currently being >>> done in SAS and SPSS. Or, at least, I am trying to show to my >>> colleagues R is not much worse than SAS/SPSS for the task at hand. >>> I've written a code that works but it's too slow. Probably because >>> it's looping through a lot of things. But I am not seeing how to >>> improve it. I've already written a different code but it's 5 times >>> slower than this one. The code below takes me slightly above 5 sec for >>> the tiny data set. I've tried using it with a real one - was not done >>> after hours. >>> Need help of the list! Maybe someone will have an idea on how to >>> increase the efficiency of my code (just one block of it - in the >>> "DATA TRANSFORMATION" Section below)? >>> >>> Below - I am creating the data set whose structure is similar to the >>> data sets the code should be applied to. Also - I have desribed what's >>> actually being done - in comments. >>> Thanks a lot to anyone for any suggestion! >>> >>> Dimitri >>> >>> ###### CREATING THE TEST DATA SET ################################ >>> >>> set.seed(123) >>> data<-data.frame(group=c(rep("first",10),rep("second",10)),week=c(1:10,1:10) >>> ,a=abs(round(rnorm(20)*10,0)), >>> b=abs(round(rnorm(20)*100,0))) >>> data >>> dim(data)[1] # !!! In real life I might have up to 150 (!) rows >>> (weeks) within each subgroup >>> >>> ### Specifying parameters used in the code below: >>> vars<-names(data)[3:4] # names of variables to be transformed >>> nr.vars<-length(vars) # number of variables to be transformed; !!! >>> in real life I'll have to deal with up to 50-60 variables, not 2. >>> group.var<-names(data)[1] # name of the grouping variable >>> subgroups<-levels(data[[group.var]]) # names of subgroups; !!! in >>> real life I'll have up to 20-25 subgroups, not 2. >>> >>> # For EACH subgroup: indexing variables a and b to their maximum in >>> that subgroup; >>> # Further, I'll have to use these indexed variables to build the new ones: >>> for(i in vars){ >>> new.name<-paste(i,".ind.to.max",sep="") >>> data[[new.name]]<-NA >>> } >>> >>> indexed.vars<-names(data)[grep("ind.to.max$", names(data))] # >>> variables indexed to subgroup max >>> for(subgroup in subgroups){ >>> data[data[[group.var]] %in% >>> subgroup,indexed.vars]<-lapply(data[data[[group.var]] %in% >>> subgroup,vars],function(x){ >>> y<-x/max(x) >>> return(y) >>> }) >>> } >>> data >>> >>> ############# DATA TRANSFORMATION ######################################### >>> >>> # Objective: Create new variables based on the old ones (a and b ind.to.max) >>> # For each new variable, the value in a given row is a function of (a) >>> 2 constants (that have several levels each), >>> # (b) the corresponding value of the original variable (e.g., >>> a.ind.to.max"), and the value in the previous row on the same new >>> variable >>> # PLUS: - it has to be done by subgroup (variable "group") >>> >>> constant1<-c(1:3) # constant 1 used for transformation - >>> has 3 levels; !!! in real life it will have up to 7 levels >>> constant2<-seq(.15,.45,.15) # constant 2 used for transformation - >>> has 3 levels; !!! in real life it will have up to 7 levels >>> >>> # CODE THAT IS TOO SLOW (it uses parameters specified in the previous >>> code section): >>> start1<-Sys.time() >>> for(var in indexed.vars){ # looping through variables >>> for(c1 in 1:length(constant1)){ # looping through levels of constant1 >>> for(c2 in 1:length(constant2)){ # looping through levels of >>> constant2 >>> d=log(0.5)/constant1[c1] >>> l=-log(1-constant2[c2]) >>> >>> name<-paste(strsplit(var,".ind.to.max"),constant1[c1],constant2[c2]*100,"..t >>> ransf",sep=".") >>> data[[name]]<-NA >>> for(subgroup in subgroups){ # looping through subgroups >>> data[data[[group.var]] %in% subgroup, name][1] = >>> 1-((1-0*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup, >>> var][1]*l*10))) # this is just the very first row of each subgroup >>> for(case in 2:nrow(data[data[[group.var]] %in% subgroup, ])){ >>> # looping through the remaining rows of the subgroup >>> data[data[[group.var]] %in% subgroup, name][case]= >>> 1-((1-data[data[[group.var]] %in% subgroup, >>> name][case-1]*exp(1)^d)/(exp(1)^(data[data[[group.var]] %in% subgroup, >>> var][case]*l*10))) >>> } >>> } >>> } >>> } >>> } >>> end1<-Sys.time() >>> print(end1-start1) # Takes me ~0.53 secs >>> names(data) >>> data >>> >>> -- >>> Dimitri Liakhovitski >>> Ninah.com >>> dimitri.liakhovit...@ninah.com >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> >>> >> >> >> > > > -- > Martin Morgan > Computational Biology / Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N. > PO Box 19024 Seattle, WA 98109 > > Location: Arnold Building M1 B861 > Phone: (206) 667-2793 > -- Dimitri Liakhovitski Ninah.com dimitri.liakhovit...@ninah.com ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.