On Dec 29, 2010, at 11:03 AM, Ali Salekfard wrote: > David, > > Thanks alot. Your code is worked fine on the whole dataset (no > memory error as I had with the other ideas). I do like the style - > especialy the fact that it is all in one line - , but for large > datasets it takes longer than what I wrote. I ran it on the same > machine with the same set of rules of 144,643 your code takes 81.50 > seconds. > > > a<-my.mapping[ with(my.mapping, DATE == ave( DATE, > ACCOUNT,FUN=max )), ] > > Description Duration > 1 Max.Date for Mappings 81.498 > > I guess the running time of your algorithm is exponential to the > number of rows.
If the large database has a large number of columns there might be improvement from just using the necessary columns. a<-my.mapping[ with(my.mapping[ , c("DATE", "ACCOUNT")] , DATE == ave( DATE, ACCOUNT,FUN=max )), ] Or using subset. It occurs to me that this my be applicable to a problem I have on my to-do list, so if I run into problems on my dataset which is about 30 time longer than yours, I will have a backup plan. Best; David. > > Ali > > On Wed, Dec 29, 2010 at 3:24 PM, David Winsemius <dwinsem...@comcast.net > > wrote: > > On Dec 29, 2010, at 9:24 AM, Ali Salekfard wrote: > > Thanks to everyone. Joshua's response seemed the most concise one, > but it > used up so much memory that my R just gave error. I checked the other > replies and all in all I came up with this, and thought to share it > with > others and get comments. > > My structure was as follows: > > ACCOUNT RULE DATE > A1 xxxx 2010-01-01 > A2 xxxx 2007-05-01 > A2 xxxx 2007-05-01 > A2 xxxx 2005-05-01 > A2 xxxx 2005-05-01 > A1 xxxx 2009-01-01 > > The most efficient solution I came across involves the following > steps: > > 1. Find the latest date for each account, and convert it to a data > frame: > > a<-tapply(my.mapping$DATE,my.mapping$ACCOUNT,max) > a<-data.frame(ACCOUNT=names(a),DT=as.Date(a,"%Y-%m-%d")) > 2. merge the set with the original data > > my.mapping<-merge(x=my.mapping,y=a,by.x="ACCOUNT",by.y="ACCOUNT") > > 3. Create a take column, which is to confirm if the date of the row > is the > maximum date for the account. > my.mapping<-cbind(my.mapping,TAKE=my.mapping$DATE==my.mapping$DT) > 4. Filter out all lines except those with TAKE==TRUE. > > my.mapping<-my.mapping[my.mapping$TAKE==TRUE,] > The running time for my whole list was 4.5 sec which is far better > than any > other ways I tried. Let me have your thoughts on that. > > My first thought is that you should use more spaces in your code. It > looks quite a bit more complex than the method I suggested (and my > benchmark says mine was maybe 50% faster, but with Maechler's > improvements is now about 4 times faster. I guess I shouldn't throw > too many stones about coding style.) > > my.mapping[ with(my.mapping, DATE == ave( DATE, > ACCOUNT, > FUN=max} ), ] > #------------------ > require(rbenchmark) > ave.method = function(df, acc, dt) > {df[with( df, dt == ave(dt, acc, FUN=max)), ]} > merge.method = function(df, acc, dt) { > a<- tapply(df[[dt]], df[[acc]],max) > a <- data.frame(ACCOUNT=names(a), DT=a) > df <- merge(x=df, y=a, by.x=acc, by.y="ACCOUNT") > df <- cbind(df, TAKE=df[dt]==df$DT) > df <- df[df$TAKE==TRUE,]} > benchmark( > rep=ave.method(airquality, "Month", "Day"), > pat=merge.method(airquality, "Month", "Day"), > replications=1000, > order=c('replications', 'elapsed')) > #----------------- > test replications elapsed relative user.self sys.self user.child > sys.child > 1 rep 1000 2.523 1.000000 2.512 0.018 > 0 0 > 2 pat 1000 7.847 3.110186 7.773 0.092 > 0 0 > > > It does give the same answers when tested on airquality, though. > That says something for it I suppose. (Had you offered a sensible > test dataset in your first posting , I would have offered a solution > using your column names, but as it was I figured you should have > been able to make the mappings.) > > > -- > David. > > > > Ali > > > David Winsemius, MD > West Hartford, CT > > David Winsemius, MD West Hartford, CT [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.