Hi: There are several ways to do this; I'll offer one from the plyr package. See inline.
On Mon, Jul 4, 2011 at 9:57 PM, Mateus Rabello <mateus_rabe...@hotmail.com> wrote: > Hi, suppose that I have the following data.frame: > > cnae4 cnpj 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Y > 24996 10020470 1 1 2 12 16 21 17 51 43 19 183 > 24996 10020470 69 91 79 92 91 77 90 96 98 108 891 > 36145 10020470 0 0 0 0 2 83 112 97 91 144 529 > 44444 10023333 5 20 60 0 0 0 0 5 20 1000 1110 > > > I would like to create a new variable X that indicates which line, within the > cnpj variable, has the highest value Y. For instance, within the cnpj = > 10020470, the second line has the largest value Y (891). For cnpj = 10023333 > is trivial (1110). Then, my new data.frame would become: > > cnae4 cnpj 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Y X > 24996 10020470 1 1 2 12 16 21 17 51 43 19 183 FALSE > 24996 10020470 69 91 79 92 91 77 90 96 98 108 891 TRUE > 36145 10020470 0 0 0 0 2 83 112 97 91 144 529 FALSE > 44444 10023333 5 20 60 0 0 0 0 5 20 1000 1110 TRUE > > > Notice that for every value of the variable cnpj, only one line will have X = > TRUE. > > Then, I would like to create a variable Z that is the sum of variable Y, also > by variable cnpj. Thus, if cnpj = 10020470, Z = 183 + 891 +529 and for cnpj = > 10023333, Z = 120. These sums can easily be done with tapply or aggregate but > those would eliminate line with equal cnpj and I don’t want that. I would > like to achieve a data.frame like the following: > > cnae4 cnpj 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 Y X Z > 24996 10020470 1 1 2 12 16 21 17 51 43 19 183 FALSE 1603 > 24996 10020470 69 91 79 92 91 77 90 96 98 108 891 TRUE 1603 > 36145 10020470 0 0 0 0 2 83 112 97 91 144 529 FALSE 1603 > 44444 10023333 5 20 60 0 0 0 0 5 20 1000 1110 TRUE 1110 To get the above structure, then assuming this data frame is named df, one way is to use the ddply() function in the plyr package with an external function to do the work for a generic subset with constant cnpj: library(plyr) myfun <- function(d) { d$X <- d$Y == max(d$Y) d$Z <- sum(d$Y) d } ddply(df, .(cnpj), myfun) > > > In the end I will eliminate all lines with X = FALSE. To do this, all you need is to rewrite the function slightly: myfun2 <- function(d) { d$Z <- sum(d$Y) d[which.max(d$Y), ] } ddply(df, .(cnpj), myfun) HTH, Dennis > > > Thank you and sorry for the long question. > > Mateus Rabello > [[alternative HTML version deleted]] > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.