Re: [R] hwo to speed up "aggregate"

Dennis Murphy Wed, 26 Jan 2011 04:15:18 -0800

If you have a large data frame, one option is package data.table. Try the
following:


library(data.table)
dt <- data.table(df)
dt[, list(qsum = sum(quantity)), by ='client, date, name']
     client       date  name qsum
[1,]      1 2010-01-01   one   45
[2,]      1 2011-01-01   one 4500
[3,]      2 2010-01-01   two   40
[4,]      2 2011-01-01   two 4000
[5,]      3 2010-01-01 three   20
[6,]      3 2011-01-01 three 2000

BTW, the leading comma after the opening bracket is not a typo :)

For R versions >= 2.11.x, aggregate() has a formula interface that saves a
fair bit of typing:

aggregate(quantity ~ client + date + name, data = df, FUN = sum)
  client       date  name quantity
1      1 2010-01-01   one       45
2      1 2011-01-01   one     4500
3      3 2010-01-01 three       20
4      3 2011-01-01 three     2000
5      2 2010-01-01   two       40
6      2 2011-01-01   two     4000

A third option is package plyr and function ddply():

library(plyr)
ddply(df, .(client, date, name), summarise, qsum = sum(quantity))
# same output as data.table

Hopefully one or more of these will improve your processing time.

Dennis

On Wed, Jan 26, 2011 at 2:39 AM, analys...@hotmail.com <
analys...@hotmail.com> wrote:

> I have
>
> > df
>   quantity branch client       date  name
> 1        10      1      1 2010-01-01   one
> 2        20      2      1 2010-01-01   one
> 3        30      3      2 2010-01-01   two
> 4        15      4      1 2010-01-01   one
> 5        10      5      2 2010-01-01   two
> 6        20      6      3 2010-01-01 three
> 7      1000      1      1 2011-01-01   one
> 8      2000      2      1 2011-01-01   one
> 9      3000      3      2 2011-01-01   two
> 10     1500      4      1 2011-01-01   one
> 11     1000      5      2 2011-01-01   two
> 12     2000      6      3 2011-01-01 three
>
> I want to aggregate away the branch. I followed a suggestion by Gabor
> (thanks) and did
>
> >
> aggregate(list(quantity=df$quantity),list(client=df$client,date=df$date),sum)
>  client       date quantity
> 1      1 2010-01-01       45
> 2      2 2010-01-01       40
> 3      3 2010-01-01       20
> 4      1 2011-01-01     4500
> 5      2 2011-01-01     4000
> 6      3 2011-01-01     2000
>
> I want df$name also in the output and did what looked obvious:
>
> >
> aggregate(list(quantity=df$quantity),list(client=df$client,date=df$date,name=df$name),sum)
>  client       date  name quantity
> 1      1 2010-01-01   one       45
> 2      1 2011-01-01   one     4500
> 3      3 2010-01-01 three       20
> 4      3 2011-01-01 three     2000
> 5      2 2010-01-01   two       40
> 6      2 2011-01-01   two     4000
>
> It seems to work, but slows down tremendously for a dataframe with
> around a 1000 rows.
>
> Could anyone explain what is going on and suggest a way out?
>
> Thanks.
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] hwo to speed up "aggregate"

Reply via email to