Re: [R] Removing rows with earlier dates

Ali Salekfard Wed, 29 Dec 2010 08:49:52 -0800

David,

Thanks alot. Your code is worked fine on the whole dataset (no memory error
as I had with the other ideas). I do like the style - especialy the fact
that it is all in one line - , but for large datasets it takes longer than
what I wrote. I ran it on the same machine with the same set of rules of
144,643 your code takes 81.50 seconds.


> a<-my.mapping[ with(my.mapping, DATE == ave( DATE, ACCOUNT,FUN=max )), ]

         Description Duration
1 Max.Date for Mappings   81.498

I guess the running time of your algorithm is exponential to the number of
rows.

Ali

On Wed, Dec 29, 2010 at 3:24 PM, David Winsemius <dwinsem...@comcast.net>wrote:

>
> On Dec 29, 2010, at 9:24 AM, Ali Salekfard wrote:
>
> Thanks to everyone. Joshua's response seemed the most concise one, but it
>> used up so much memory that my R just gave error. I checked the other
>> replies and all in all I came up with this, and thought to share it with
>> others and get comments.
>>
>> My structure was as follows:
>>
>> ACCOUNT   RULE  DATE
>> A1             xxxx     2010-01-01
>> A2             xxxx     2007-05-01
>> A2             xxxx     2007-05-01
>> A2             xxxx     2005-05-01
>> A2             xxxx     2005-05-01
>> A1             xxxx     2009-01-01
>>
>> The most efficient solution I came across involves the following steps:
>>
>> 1. Find the latest date for each account, and convert it to a data frame:
>>
>> a<-tapply(my.mapping$DATE,my.mapping$ACCOUNT,max)
>> a<-data.frame(ACCOUNT=names(a),DT=as.Date(a,"%Y-%m-%d"))
>> 2. merge the set with the original data
>>
>> my.mapping<-merge(x=my.mapping,y=a,by.x="ACCOUNT",by.y="ACCOUNT")
>>
>> 3. Create a take column, which is to confirm if the date of the row is the
>> maximum date for the account.
>> my.mapping<-cbind(my.mapping,TAKE=my.mapping$DATE==my.mapping$DT)
>> 4. Filter out all lines except those with TAKE==TRUE.
>>
>> my.mapping<-my.mapping[my.mapping$TAKE==TRUE,]
>> The running time for my whole list was 4.5 sec which is far better than
>> any
>> other ways I tried. Let me have your thoughts on that.
>>
>
> My first thought is that you should use more spaces in your code. It looks
> quite a bit more complex than the method I suggested (and my benchmark says
> mine was maybe 50% faster, but with Maechler's improvements is now about 4
> times faster. I guess I shouldn't throw too many stones about coding style.)
>
> my.mapping[ with(my.mapping, DATE == ave( DATE,
>                                          ACCOUNT,
>                                          FUN=max} ), ]
> #------------------
> require(rbenchmark)
> ave.method = function(df, acc, dt)
>   {df[with( df, dt == ave(dt, acc, FUN=max)), ]}
> merge.method = function(df, acc, dt) {
>   a<- tapply(df[[dt]], df[[acc]],max)
>   a  <- data.frame(ACCOUNT=names(a), DT=a)
>   df <- merge(x=df, y=a, by.x=acc, by.y="ACCOUNT")
>   df <- cbind(df, TAKE=df[dt]==df$DT)
> df <- df[df$TAKE==TRUE,]}
> benchmark(
>   rep=ave.method(airquality, "Month", "Day"),
>   pat=merge.method(airquality, "Month", "Day"),
>   replications=1000,
>   order=c('replications', 'elapsed'))
> #-----------------
>  test replications elapsed relative user.self sys.self user.child sys.child
> 1  rep         1000   2.523 1.000000     2.512    0.018          0
> 0
> 2  pat         1000   7.847 3.110186     7.773    0.092          0
> 0
>
>
> It does give the same answers when tested on airquality, though. That says
> something for it I suppose. (Had you offered a sensible test dataset in your
> first posting , I would have offered a solution using your column names, but
> as it was I figured you should have been able to make the mappings.)
>
>
> --
> David.
>
>
>
>> Ali
>>
>
>
> David Winsemius, MD
> West Hartford, CT
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Removing rows with earlier dates

Reply via email to