Re: [R] ddply to count frequency of combinations

Brian Diggs Thu, 23 Jun 2011 08:46:07 -0700

On 6/22/2011 11:02 PM, Idris Raja wrote:

Brian,


I'm a bit confused about how the following line works, specifically, what is
happening in freq=length(x)? Is it just taking the length of x after it has
been summarized by different combinations x&  y? I guess that must be the
case, because that gives the same result as using freq=length(y)

d1<-ddply(d, .(x, y), summarize, freq=length(x))
d2<-ddply(d, .(x, y), summarize, freq=length(y))

Effectively, ddply takes the dataframe (d), splits it up into multipledataframes based on unique combinations of the variables (x and y), andcalls the function (summarize) with each of the sub-dataframes in turn.ddply also has the option to pass additional parameters to thefunction that is called. In this case, that is what happens withfreq=length(x). Each sub-dataframe is the first argument to a call tosummarize([sub-dataframe], freq=length(x)).

summarize, in turn, takes a dataframe and other arguments in the form ofvar=value. It evaluates each of the values in the context of thedataframe (that is, column names can be used directly as variables) andassigns the result to the variable var. These var's then become thecolumns of a new dataframe.


> summarize(df, freq=length(x))
  freq
1    9

You are right that length(y) would work just as well; since they areboth columns in the same dataframe, they must have the same length.

(The last thing ddply does is take all the dataframes that are returnedfrom the function calls and put them back together into a singledataframe which also includes information on which subset eachcorresponds to.)

Also, what is the significance of the periods before the second argument in
".(x, y)" ?

The variables to split on can be given "as quoted variables, a formulaor character vector". The . is a function in plyr that quotes variables(the first option). The following three are identical:


ddply(df, .(x, y), summarise, freq=length(x))
ddply(df, ~x+y, summarise, freq=length(x))
ddply(df, c("x", "y"), summarise, freq=length(x))

Thanks for the help.


You may also benefit from reading Hadley's paper on the topic:

Hadley Wickham (2011). The Split-Apply-Combine Strategy for DataAnalysis. Journal of Statistical Software, 40(1), 1-29.http://www.jstatsoft.org/v40/i01/.

On Tue, Jun 21, 2011 at 12:54 PM, Brian Diggs<dig...@ohsu.edu>  wrote:

On 6/21/2011 11:30 AM, Idris Raja wrote:

I have a dataframe df with two columns x and y. I want to count the number
of times a unique x, y combination occurs.

For example

x<- c(1,2,3,4,5,1,2,3,4)
y<- c(1,2,3,4,5,1,2,4,1)

df<-as.data.frame(cbind(x, y))

#what is the correct way to use ddply for this example?
ddply(df, c('x','y', summarize, ??)

#desired output -- format and order doesn't matter
# (x, y) count
#--------------------
# (1, 1) 2
# (2, 2) 2
# (3, 3) 1
# (4, 4) 1
# (5, 5) 1
# (2, 3) 1
# (3, 4) 1
# (4, 1) 1

        [[alternative HTML version deleted]]


Jorge and Dennis gave good responses that get you to the result you asked
for, but for completeness I thought I'd include some ddply versions:

ddply(d, .(x, y), summarize, freq=length(x))

This uses the summarize function you were asking about, however you can
also do it with:

ddply(d, .(x, y), nrow)

or

ddply(d, .(x, y), as.data.frame(nrow))

The latter giving a slightly nicer name (value instead of V1).

As an aside, I prefer using the "summarise" spelling of the function when I
do use it, because it won't clash with Hmisc::summarize.

ddply(d, .(x, y), summarise, freq=length(x))


--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health&  Science University


        [[alternative HTML version deleted]]


--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] ddply to count frequency of combinations

Reply via email to