On Jan 16, 2012, at 4:13 PM, Felipe Nunes wrote:

> Dear all,
> 
> I have a data set in which the same unit appears 2 or 3 or 4 times. I need
> to aggregate this data to maintain only one unit by row. But I need to do
> that based on a comparison between the values of such units. I can't find a
> function to help me on that. I appreciate any help. Below I provide an
> example of what I want:
> 
> This is my data:
> 
> Units  Var1 Var2 Var3
>  1          B       2        2
>  1          C       1       3
>  2          D      3        3
>  2          C      2        1
>  2          A      1        2
>  3          C      2        1
>  3          A      1        3
>  4          B      3        3
>  4          C      2        2
>  4          A      1        1
> 
> I want to produce a new data in which I keep, for each unit, the row with
> max number for Var2. For instance,
> 
> Units  Var1 Var2 Var3
>  1          B       2       2
>  2          D      3        3
>  3          C      2        1
>  4          B      3        3
> 
> Thanks,


There are likely to be multiple approaches, but the first that comes to mind is 
combining ?merge with ?aggregate. 

> DF
   Units Var1 Var2 Var3
1      1    B    2    2
2      1    C    1    3
3      2    D    3    3
4      2    C    2    1
5      2    A    1    2
6      3    C    2    1
7      3    A    1    3
8      4    B    3    3
9      4    C    2    2
10     4    A    1    1

> merge(aggregate(Var2 ~ Units, max, data = DF), DF)
  Units Var2 Var1 Var3
1     1    2    B    2
2     2    3    D    3
3     3    2    C    1
4     4    3    B    3


The use of aggregate() gets you the max value of Var2 for each group within 
Units:

> aggregate(Var2 ~ Units, max, data = DF)
  Units Var2
1     1    2
2     2    3
3     3    2
4     4    3

merge() then allows you to add back the columns from 'DF', matching on 'Units' 
and 'Var2', which are the common columns between the two data frames. Note that 
the ordering of the columns is changed a bit here. That is easily adjusted with:

> merge(aggregate(Var2 ~ Units, max, data = DF), DF)[, c(1, 3, 2, 4)]
  Units Var1 Var2 Var3
1     1    B    2    2
2     2    D    3    3
3     3    C    2    1
4     4    B    3    3


Note that you would want to use the 'na.rm = TRUE' argument to max() in the 
call to aggregate() if in fact any of your Var2 arguments are NA.

HTH,

Marc Schwartz

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to