On Dec 29, 2010, at 11:03 AM, Ali Salekfard wrote:
> David,
>
> Thanks alot. Your code is worked fine on the whole dataset (no
> memory error as I had with the other ideas). I do like the style -
> especialy the fact that it is all in one line - , but for large
> datasets it takes longer than what I wrote. I ran it on the same
> machine with the same set of rules of 144,643 your code takes 81.50
> seconds.
>
> > a<-my.mapping[ with(my.mapping, DATE == ave( DATE,
> ACCOUNT,FUN=max )), ]
>
> Description Duration
> 1 Max.Date for Mappings 81.498
>
> I guess the running time of your algorithm is exponential to the
> number of rows.
If the large database has a large number of columns there might be
improvement from just using the necessary columns.
a<-my.mapping[ with(my.mapping[ , c("DATE", "ACCOUNT")] , DATE ==
ave( DATE, ACCOUNT,FUN=max )), ]
Or using subset.
It occurs to me that this my be applicable to a problem I have on my
to-do list, so if I run into problems on my dataset which is about 30
time longer than yours, I will have a backup plan.
Best;
David.
>
> Ali
>
> On Wed, Dec 29, 2010 at 3:24 PM, David Winsemius <[email protected]
> > wrote:
>
> On Dec 29, 2010, at 9:24 AM, Ali Salekfard wrote:
>
> Thanks to everyone. Joshua's response seemed the most concise one,
> but it
> used up so much memory that my R just gave error. I checked the other
> replies and all in all I came up with this, and thought to share it
> with
> others and get comments.
>
> My structure was as follows:
>
> ACCOUNT RULE DATE
> A1 xxxx 2010-01-01
> A2 xxxx 2007-05-01
> A2 xxxx 2007-05-01
> A2 xxxx 2005-05-01
> A2 xxxx 2005-05-01
> A1 xxxx 2009-01-01
>
> The most efficient solution I came across involves the following
> steps:
>
> 1. Find the latest date for each account, and convert it to a data
> frame:
>
> a<-tapply(my.mapping$DATE,my.mapping$ACCOUNT,max)
> a<-data.frame(ACCOUNT=names(a),DT=as.Date(a,"%Y-%m-%d"))
> 2. merge the set with the original data
>
> my.mapping<-merge(x=my.mapping,y=a,by.x="ACCOUNT",by.y="ACCOUNT")
>
> 3. Create a take column, which is to confirm if the date of the row
> is the
> maximum date for the account.
> my.mapping<-cbind(my.mapping,TAKE=my.mapping$DATE==my.mapping$DT)
> 4. Filter out all lines except those with TAKE==TRUE.
>
> my.mapping<-my.mapping[my.mapping$TAKE==TRUE,]
> The running time for my whole list was 4.5 sec which is far better
> than any
> other ways I tried. Let me have your thoughts on that.
>
> My first thought is that you should use more spaces in your code. It
> looks quite a bit more complex than the method I suggested (and my
> benchmark says mine was maybe 50% faster, but with Maechler's
> improvements is now about 4 times faster. I guess I shouldn't throw
> too many stones about coding style.)
>
> my.mapping[ with(my.mapping, DATE == ave( DATE,
> ACCOUNT,
> FUN=max} ), ]
> #------------------
> require(rbenchmark)
> ave.method = function(df, acc, dt)
> {df[with( df, dt == ave(dt, acc, FUN=max)), ]}
> merge.method = function(df, acc, dt) {
> a<- tapply(df[[dt]], df[[acc]],max)
> a <- data.frame(ACCOUNT=names(a), DT=a)
> df <- merge(x=df, y=a, by.x=acc, by.y="ACCOUNT")
> df <- cbind(df, TAKE=df[dt]==df$DT)
> df <- df[df$TAKE==TRUE,]}
> benchmark(
> rep=ave.method(airquality, "Month", "Day"),
> pat=merge.method(airquality, "Month", "Day"),
> replications=1000,
> order=c('replications', 'elapsed'))
> #-----------------
> test replications elapsed relative user.self sys.self user.child
> sys.child
> 1 rep 1000 2.523 1.000000 2.512 0.018
> 0 0
> 2 pat 1000 7.847 3.110186 7.773 0.092
> 0 0
>
>
> It does give the same answers when tested on airquality, though.
> That says something for it I suppose. (Had you offered a sensible
> test dataset in your first posting , I would have offered a solution
> using your column names, but as it was I figured you should have
> been able to make the mappings.)
>
>
> --
> David.
>
>
>
> Ali
>
>
> David Winsemius, MD
> West Hartford, CT
>
>
David Winsemius, MD
West Hartford, CT
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.