Re: [R] Improving data processing efficiency

Daniel Folkinshteyn Fri, 06 Jun 2008 16:00:06 -0700

thanks for the suggestions! I'll play with this over the weekend and seewhat comes out. :)


on 06/06/2008 06:48 PM Don MacQueen said the following:

In a case like this, if you can possibly work with matrices instead ofdata frames, you might get significant speedup.(More accurately, I have had situations where I obtained speed up byworking with matrices instead of dataframes.)
Even if you have to code character columns as numeric, it can be worth it.
Data frames have overhead that matrices do not. (Here's where profilingmight have given a clue) Granted, there has been recent work in reducingthe overhead associated with dataframes, but I think it's worth a try.Carrying along extra columns and doing row subsetting, rbinding, etc,means a lot more things happening in memory.
So, for example, if all of your matching is based just on a few columns,extract those columns, convert them to a matrix, do all the matching,and then based on some sort of row index retrieve all of the associatedcolumns.
-Don

At 2:09 PM -0400 6/5/08, Daniel Folkinshteyn wrote:
Hi everyone!

I have a question about data processing efficiency.
My data are as follows: I have a data set on quarterly institutionalownership of equities; some of them have had recent IPOs, some havenot (I have a binary flag set). The total dataset size is 700k+ rows.
My goal is this: For every quarter since issue for each IPO, I need tofind a "matched" firm in the same industry, and close in market cap.So, e.g., for firm X, which had an IPO, i need to find a matchednon-issuing firm in quarter 1 since IPO, then a (possibly different)non-issuing firm in quarter 2 since IPO, etc. Repeat for each issuingfirm (there are about 8300 of these).
Thus it seems to me that I need to be doing a lot of data selectionand subsetting, and looping (yikes!), but the result appears to behighly inefficient and takes ages (well, many hours). What I am doing,in pseudocode, is this:
1. for each quarter of data, getting out all the IPOs and all theeligible non-issuing firms.2. for each IPO in a quarter, grab all the non-issuers in the sameindustry, sort them by size, and finally grab a matching firm closestin size (the exact procedure is to grab the closest bigger firm if oneexists, and just the biggest available if all are smaller)3. assign the matched firm-observation the same "quarters since issue"as the IPO being matched
4. rbind them all into the "matching" dataset.
The function I currently have is pasted below, for your reference. Isthere any way to make it produce the same result but much faster?Specifically, I am guessing eliminating some loops would be very good,but I don't see how, since I need to do some fancy footwork for eachIPO in each quarter to find the matching firm. I'll be doing a fewthings similar to this, so it's somewhat important to up theefficiency of this. Maybe some of you R-fu masters can clue me in? :)
I would appreciate any help, tips, tricks, tweaks, you name it! :)

========== my function below ===========
fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata,quarters_since_issue=40) {
result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix ischeaper, so typecast the result to matrix
    colnames = names(tfdata)

    quarterends = sort(unique(tfdata$DATE))

    for (aquarter in quarterends) {
        tfdata_quarter = tfdata[tfdata$DATE == aquarter, ]
tfdata_quarter_fitting_nonissuers = tfdata_quarter[(tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) &(tfdata_quarter$IPO.Flag == 0), ]tfdata_quarter_ipoissuers = tfdata_quarter[tfdata_quarter$IPO.Flag == 1, ]
        for (i in 1:nrow(tfdata_quarter_ipoissuers)) {
            arow = tfdata_quarter_ipoissuers[i,]
industrypeers = tfdata_quarter_fitting_nonissuers[tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ]industrypeers = industrypeers[order(industrypeers$Market.Cap.13f), ]
            if ( nrow(industrypeers) > 0 ) {
if ( nrow(industrypeers[industrypeers$Market.Cap.13f>= arow$Market.Cap.13f, ]) > 0 ) {bestpeer =industrypeers[industrypeers$Market.Cap.13f >= arow$Market.Cap.13f, ][1,]
                }
                else {
                    bestpeer = industrypeers[nrow(industrypeers),]
                }
bestpeer$Quarters.Since.IPO.Issue =arow$Quarters.Since.IPO.Issue
#tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO ==bestpeer$PERMNO] = 1
                result = rbind(result, as.matrix(bestpeer))
            }
        }
        #result = rbind(result, tfdata_quarter)
        print (aquarter)
    }

    result = as.data.frame(result)
    names(result) = colnames
    return(result)

}

========= end of my function =============

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Improving data processing efficiency

Reply via email to