That is the last line of every message to r-help. On Fri, Jun 6, 2008 at 12:05 PM, Gabor Grothendieck <[EMAIL PROTECTED]> wrote: > Its summarized in the last line to r-help. Note reproducible and > minimal. > > On Fri, Jun 6, 2008 at 12:03 PM, Daniel Folkinshteyn <[EMAIL PROTECTED]> > wrote: >> i did! what did i miss? >> >> on 06/06/2008 11:45 AM Gabor Grothendieck said the following: >>> >>> Try reading the posting guide before posting. >>> >>> On Fri, Jun 6, 2008 at 11:12 AM, Daniel Folkinshteyn <[EMAIL PROTECTED]> >>> wrote: >>>> >>>> Anybody have any thoughts on this? Please? :) >>>> >>>> on 06/05/2008 02:09 PM Daniel Folkinshteyn said the following: >>>>> >>>>> Hi everyone! >>>>> >>>>> I have a question about data processing efficiency. >>>>> >>>>> My data are as follows: I have a data set on quarterly institutional >>>>> ownership of equities; some of them have had recent IPOs, some have not >>>>> (I >>>>> have a binary flag set). The total dataset size is 700k+ rows. >>>>> >>>>> My goal is this: For every quarter since issue for each IPO, I need to >>>>> find a "matched" firm in the same industry, and close in market cap. So, >>>>> e.g., for firm X, which had an IPO, i need to find a matched non-issuing >>>>> firm in quarter 1 since IPO, then a (possibly different) non-issuing >>>>> firm in >>>>> quarter 2 since IPO, etc. Repeat for each issuing firm (there are about >>>>> 8300 >>>>> of these). >>>>> >>>>> Thus it seems to me that I need to be doing a lot of data selection and >>>>> subsetting, and looping (yikes!), but the result appears to be highly >>>>> inefficient and takes ages (well, many hours). What I am doing, in >>>>> pseudocode, is this: >>>>> >>>>> 1. for each quarter of data, getting out all the IPOs and all the >>>>> eligible >>>>> non-issuing firms. >>>>> 2. for each IPO in a quarter, grab all the non-issuers in the same >>>>> industry, sort them by size, and finally grab a matching firm closest in >>>>> size (the exact procedure is to grab the closest bigger firm if one >>>>> exists, >>>>> and just the biggest available if all are smaller) >>>>> 3. assign the matched firm-observation the same "quarters since issue" >>>>> as >>>>> the IPO being matched >>>>> 4. rbind them all into the "matching" dataset. >>>>> >>>>> The function I currently have is pasted below, for your reference. Is >>>>> there any way to make it produce the same result but much faster? >>>>> Specifically, I am guessing eliminating some loops would be very good, >>>>> but I >>>>> don't see how, since I need to do some fancy footwork for each IPO in >>>>> each >>>>> quarter to find the matching firm. I'll be doing a few things similar to >>>>> this, so it's somewhat important to up the efficiency of this. Maybe >>>>> some of >>>>> you R-fu masters can clue me in? :) >>>>> >>>>> I would appreciate any help, tips, tricks, tweaks, you name it! :) >>>>> >>>>> ========== my function below =========== >>>>> >>>>> fcn_create_nonissuing_match_by_quarterssinceissue = function(tfdata, >>>>> quarters_since_issue=40) { >>>>> >>>>> result = matrix(nrow=0, ncol=ncol(tfdata)) # rbind for matrix is >>>>> cheaper, so typecast the result to matrix >>>>> >>>>> colnames = names(tfdata) >>>>> >>>>> quarterends = sort(unique(tfdata$DATE)) >>>>> >>>>> for (aquarter in quarterends) { >>>>> tfdata_quarter = tfdata[tfdata$DATE == aquarter, ] >>>>> >>>>> tfdata_quarter_fitting_nonissuers = tfdata_quarter[ >>>>> (tfdata_quarter$Quarters.Since.Latest.Issue > quarters_since_issue) & >>>>> (tfdata_quarter$IPO.Flag == 0), ] >>>>> tfdata_quarter_ipoissuers = tfdata_quarter[ >>>>> tfdata_quarter$IPO.Flag >>>>> == 1, ] >>>>> >>>>> for (i in 1:nrow(tfdata_quarter_ipoissuers)) { >>>>> arow = tfdata_quarter_ipoissuers[i,] >>>>> industrypeers = tfdata_quarter_fitting_nonissuers[ >>>>> tfdata_quarter_fitting_nonissuers$HSICIG == arow$HSICIG, ] >>>>> industrypeers = industrypeers[ >>>>> order(industrypeers$Market.Cap.13f), ] >>>>> if ( nrow(industrypeers) > 0 ) { >>>>> if ( nrow(industrypeers[industrypeers$Market.Cap.13f >= >>>>> arow$Market.Cap.13f, ]) > 0 ) { >>>>> bestpeer = industrypeers[industrypeers$Market.Cap.13f >>>>>> >>>>>> = arow$Market.Cap.13f, ][1,] >>>>> >>>>> } >>>>> else { >>>>> bestpeer = industrypeers[nrow(industrypeers),] >>>>> } >>>>> bestpeer$Quarters.Since.IPO.Issue = >>>>> arow$Quarters.Since.IPO.Issue >>>>> >>>>> #tfdata_quarter$Match.Dummy.By.Quarter[tfdata_quarter$PERMNO == >>>>> bestpeer$PERMNO] = 1 >>>>> result = rbind(result, as.matrix(bestpeer)) >>>>> } >>>>> } >>>>> #result = rbind(result, tfdata_quarter) >>>>> print (aquarter) >>>>> } >>>>> >>>>> result = as.data.frame(result) >>>>> names(result) = colnames >>>>> return(result) >>>>> >>>>> } >>>>> >>>>> ========= end of my function ============= >>>>> >>>> ______________________________________________ >>>> R-help@r-project.org mailing list >>>> https://stat.ethz.ch/mailman/listinfo/r-help >>>> PLEASE do read the posting guide >>>> http://www.R-project.org/posting-guide.html >>>> and provide commented, minimal, self-contained, reproducible code. >>>> >>> >> >
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.