David, Thanks alot. Your code is worked fine on the whole dataset (no memory error as I had with the other ideas). I do like the style - especialy the fact that it is all in one line - , but for large datasets it takes longer than what I wrote. I ran it on the same machine with the same set of rules of 144,643 your code takes 81.50 seconds.
> a<-my.mapping[ with(my.mapping, DATE == ave( DATE, ACCOUNT,FUN=max )), ] Description Duration 1 Max.Date for Mappings 81.498 I guess the running time of your algorithm is exponential to the number of rows. Ali On Wed, Dec 29, 2010 at 3:24 PM, David Winsemius <dwinsem...@comcast.net>wrote: > > On Dec 29, 2010, at 9:24 AM, Ali Salekfard wrote: > > Thanks to everyone. Joshua's response seemed the most concise one, but it >> used up so much memory that my R just gave error. I checked the other >> replies and all in all I came up with this, and thought to share it with >> others and get comments. >> >> My structure was as follows: >> >> ACCOUNT RULE DATE >> A1 xxxx 2010-01-01 >> A2 xxxx 2007-05-01 >> A2 xxxx 2007-05-01 >> A2 xxxx 2005-05-01 >> A2 xxxx 2005-05-01 >> A1 xxxx 2009-01-01 >> >> The most efficient solution I came across involves the following steps: >> >> 1. Find the latest date for each account, and convert it to a data frame: >> >> a<-tapply(my.mapping$DATE,my.mapping$ACCOUNT,max) >> a<-data.frame(ACCOUNT=names(a),DT=as.Date(a,"%Y-%m-%d")) >> 2. merge the set with the original data >> >> my.mapping<-merge(x=my.mapping,y=a,by.x="ACCOUNT",by.y="ACCOUNT") >> >> 3. Create a take column, which is to confirm if the date of the row is the >> maximum date for the account. >> my.mapping<-cbind(my.mapping,TAKE=my.mapping$DATE==my.mapping$DT) >> 4. Filter out all lines except those with TAKE==TRUE. >> >> my.mapping<-my.mapping[my.mapping$TAKE==TRUE,] >> The running time for my whole list was 4.5 sec which is far better than >> any >> other ways I tried. Let me have your thoughts on that. >> > > My first thought is that you should use more spaces in your code. It looks > quite a bit more complex than the method I suggested (and my benchmark says > mine was maybe 50% faster, but with Maechler's improvements is now about 4 > times faster. I guess I shouldn't throw too many stones about coding style.) > > my.mapping[ with(my.mapping, DATE == ave( DATE, > ACCOUNT, > FUN=max} ), ] > #------------------ > require(rbenchmark) > ave.method = function(df, acc, dt) > {df[with( df, dt == ave(dt, acc, FUN=max)), ]} > merge.method = function(df, acc, dt) { > a<- tapply(df[[dt]], df[[acc]],max) > a <- data.frame(ACCOUNT=names(a), DT=a) > df <- merge(x=df, y=a, by.x=acc, by.y="ACCOUNT") > df <- cbind(df, TAKE=df[dt]==df$DT) > df <- df[df$TAKE==TRUE,]} > benchmark( > rep=ave.method(airquality, "Month", "Day"), > pat=merge.method(airquality, "Month", "Day"), > replications=1000, > order=c('replications', 'elapsed')) > #----------------- > test replications elapsed relative user.self sys.self user.child sys.child > 1 rep 1000 2.523 1.000000 2.512 0.018 0 > 0 > 2 pat 1000 7.847 3.110186 7.773 0.092 0 > 0 > > > It does give the same answers when tested on airquality, though. That says > something for it I suppose. (Had you offered a sensible test dataset in your > first posting , I would have offered a solution using your column names, but > as it was I figured you should have been able to make the mappings.) > > > -- > David. > > > >> Ali >> > > > David Winsemius, MD > West Hartford, CT > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.