Re: [R] Improving data processing efficiency

Daniel Folkinshteyn Thu, 05 Jun 2008 13:52:18 -0700

Thanks, I'll take a look at Rprof... but I think what i'm missing isfacility with R idiom to get around the looping, and no amount ofprofiling will help me with that :)

also, full working code is provided in my original post (see toward thebottom).


on 06/05/2008 03:43 PM bartjoosen said the following:

Maybe you should provide a minimal, working code with data, so that we all
can give it a try.
In the mean time: take a look at the Rprof function to see where your code
can be improved.

Good luck

Bart


Daniel Folkinshteyn-2 wrote:
Hi everyone!

I have a question about data processing efficiency.
My data are as follows: I have a data set on quarterly institutionalownership of equities; some of them have had recent IPOs, some have not(I have a binary flag set). The total dataset size is 700k+ rows.
My goal is this: For every quarter since issue for each IPO, I need tofind a "matched" firm in the same industry, and close in market cap. So,e.g., for firm X, which had an IPO, i need to find a matched non-issuingfirm in quarter 1 since IPO, then a (possibly different) non-issuingfirm in quarter 2 since IPO, etc. Repeat for each issuing firm (thereare about 8300 of these).
Thus it seems to me that I need to be doing a lot of data selection andsubsetting, and looping (yikes!), but the result appears to be highlyinefficient and takes ages (well, many hours). What I am doing, inpseudocode, is this:
1. for each quarter of data, getting out all the IPOs and all theeligible non-issuing firms.2. for each IPO in a quarter, grab all the non-issuers in the sameindustry, sort them by size, and finally grab a matching firm closest insize (the exact procedure is to grab the closest bigger firm if oneexists, and just the biggest available if all are smaller)3. assign the matched firm-observation the same "quarters since issue"as the IPO being matched
4. rbind them all into the "matching" dataset.
The function I currently have is pasted below, for your reference. Isthere any way to make it produce the same result but much faster?Specifically, I am guessing eliminating some loops would be very good,but I don't see how, since I need to do some fancy footwork for each IPOin each quarter to find the matching firm. I'll be doing a few thingssimilar to this, so it's somewhat important to up the efficiency ofthis. Maybe some of you R-fu masters can clue me in? :)
I would appreciate any help, tips, tricks, tweaks, you name it! :)

========== my function below ===========
fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,quarters_since_issue=40) {
result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix ischeaper, so typecast the result to matrix
     colnames = names(tfdata)

     quarterends = sort(unique(tfdata$DATE))

     for (aquarter in quarterends) {
         tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
tfdata_quarter_fitting_nonissuers = tfdata_quarter[(tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) &(tfdata_quarter$IPO.Flag == 0), ]tfdata_quarter_ipoissuers = tfdata_quarter[tfdata_quarter$IPO.Flag == 1, ]
         for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
             arow = tfdata_quarter_ipoissuers[i,]
industrypeers = tfdata_quarter_fitting_nonissuers[tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]industrypeers = industrypeers[order(industrypeers$Market.Cap.13f), ]
             if ( nrow(industrypeers) > 0 ) {
if ( nrow(industrypeers[industrypeers$Market.Cap.13f >=arow$Market.Cap.13f, ]) > 0 ) {bestpeer =industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,]
                 }
                 else {
                     bestpeer = industrypeers[nrow(industrypeers),]
                 }
bestpeer$Quarters.Since.IPO.Issue =arow$Quarters.Since.IPO.Issue#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==bestpeer$PERMNO] = 1
                 result = rbind(result, as.matrix(bestpeer))
             }
         }
         #result = rbind(result, tfdata_quarter)
         print (aquarter)
     }

     result = as.data.frame(result)
     names(result) = colnames
     return(result)

}

========= end of my function =============

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Improving data processing efficiency

Reply via email to