Re: [R] Comparison of aggregate in R and group by in mysql

zhihuali Sat, 26 Jan 2008 18:15:07 -0800

thanks, Jim.

My system has a RAM of 1 GB. I guess the computed matrix is close to the limit 
of the memory and that's what caused the problem.  I think I'll take Wensui's 
suggestion and use a relational database system to handle the huge data.




> Date: Sat, 26 Jan 2008 20:40:51 -0500
> From: [EMAIL PROTECTED]
> To: [EMAIL PROTECTED]
> Subject: Re: [R] Comparison of aggregate in R and group by in mysql
> CC: [EMAIL PROTECTED]
> 
> I think with your data you will be computing a matrix that is 7049 x
> 11704.  This will require about 700MB of memory.  What size system do
> you have (how much memory)?  How big is the dataframe? (do 'str' and
> report what it says).  This will require a lot more resources and
> given that you have about 80M possible combinations, I would assume
> that a lot of them are probably empty.  It is having to 'split' the
> data into the groups and then summarize.  Maybe you should use a
> database with this combination of data.
> 
> 2008/1/26 zhihuali <[EMAIL PROTECTED]>:
> >
> > I repeated your experiment:
> > > n <- 1000000
> > > x <- data.frame(A=sample(LETTERS,n,TRUE), 
> > > B=sample(letters[1:4],n,TRUE),C=sample(LETTERS[1:4], n, TRUE), 
> > > data=runif(n))
> > > system.time(x.agg <- aggregate(x$data, list(x$A, x$B, x$C), mean))
> >   user  system elapsed
> >  1.824   0.212   2.038
> >
> >
> > Now I use my own data:
> > > length(levels(group))
> > [1] 7049
> > > length(levels(type))
> > [1] 11704
> > > y<-data.frame(group,type,signal)
> > > system.time(y.agg <- aggregate(y$signal, list(y$group,y$type), mean))
> >   (I killed it after 30 minutes)
> >
> >
> >
> > > Date: Sat, 26 Jan 2008 19:55:51 -0500
> > > From: [EMAIL PROTECTED]
> > > To: [EMAIL PROTECTED]
> > > Subject: Re: [R] Comparison of aggregate in R and group by in mysql
> > > CC: [EMAIL PROTECTED]
> >
> > >
> > > How large is your dataframe?  How much memory do you have on your
> > > system?  Are you paging?  Here is a test I ran with a data frame with
> > > 1,000,000 entries and it seems to be fast:
> > >
> > > > n <- 1000000
> > > > x <- data.frame(A=sample(LETTERS,n,TRUE), B=sample(letters[1:4],n,TRUE),
> > > +     C=sample(LETTERS[1:4], n, TRUE), data=runif(n))
> > > > system.time(x.agg <- aggregate(x$data, list(x$A, x$B, x$C), mean))
> > >    user  system elapsed
> > >    2.65    0.34    3.00
> > > >
> > >
> > > On Jan 26, 2008 6:45 PM, zhihuali <[EMAIL PROTECTED]> wrote:
> > > >
> > > > Hi, netters,
> > > >
> > > > First of all, thanks a lot for all the prompt replies to my earlier 
> > > > question about "merging" data frames in R.
> > > > Actually that's an equivalence to the "join" clause in mysql.
> > > >
> > > > Now I have another question. Suppose I have a data frame X with lots of 
> > > > columns/variables:
> > > > Name, Age,Group, Type, Salary.
> > > > I wanna do a subtotal of salaries:
> > > > aggregate(X$Salary, by=list(X$Group,X$Age,X$Type),Fun=mean)
> > > >
> > > > When the levels of Group and Type are huge, it took R forever to finish 
> > > > the aggregation.
> > > > And I used gc to find that the memory usage was big too.
> > > >
> > > > However, in mysql, it took seconds to finish a similar job:
> > > > select Group,Age,Type ,avg(Salary)  from X group by  Group,Age,Type
> > > >
> > > > Is it because mysql is superior in doing such kind of things? Or my R 
> > > > command is not efficient enough? Why did R have to consume huge 
> > > > memories to do the aggregation?
> > > >
> > > > Thanks again!
> > > >
> > > > Zhihua Li
> > > >
> > > > _________________________________________________________________
> > > > ÌìÁ¹ÁË£¬ÌíÒÂÁË£¬ÐÄ¶¯ÁË£¬"Æß¼þ"ÁË
> > > > http://get.live.cn
> > > >        [[alternative HTML version deleted]]
> > > >
> > > >
> > > > ______________________________________________
> > > > R-help@r-project.org mailing list
> > > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > > PLEASE do read the posting guide 
> > > > http://www.R-project.org/posting-guide.html
> > > > and provide commented, minimal, self-contained, reproducible code.
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > Jim Holtman
> > > Cincinnati, OH
> > > +1 513 646 9390
> > >
> > > What is the problem you are trying to solve?
> >
> > _________________________________________________________________
> > MSNÊ¥µ®ÀñÎï»ðÈÈµÇ³¡£¬Ãâ·Ñ·¢·ÅÖÐ£¬¿ìÀ´ÁìÈ¡°É£¡
> > http://im.live.cn/emoticons/?ID=18
> >        [[alternative HTML version deleted]]
> >
> >
> > ______________________________________________
> > R-help@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
> >
> 
> 
> 
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
> 
> What is the problem you are trying to solve?

_________________________________________________________________
ÌìÁ¹ÁË£¬ÌíÒÂÁË£¬ÐÄ¶¯ÁË£¬¡°Æß¼þ¡±ÁË 
http://get.live.cn
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Comparison of aggregate in R and group by in mysql

Reply via email to