On 6/22/2011 11:02 PM, Idris Raja wrote:
Brian,

I'm a bit confused about how the following line works, specifically, what is
happening in freq=length(x)? Is it just taking the length of x after it has
been summarized by different combinations x&  y? I guess that must be the
case, because that gives the same result as using freq=length(y)

d1<-ddply(d, .(x, y), summarize, freq=length(x))
d2<-ddply(d, .(x, y), summarize, freq=length(y))

Effectively, ddply takes the dataframe (d), splits it up into multiple dataframes based on unique combinations of the variables (x and y), and calls the function (summarize) with each of the sub-dataframes in turn. ddply also has the option to pass additional parameters to the function that is called. In this case, that is what happens with freq=length(x). Each sub-dataframe is the first argument to a call to summarize([sub-dataframe], freq=length(x)).

summarize, in turn, takes a dataframe and other arguments in the form of var=value. It evaluates each of the values in the context of the dataframe (that is, column names can be used directly as variables) and assigns the result to the variable var. These var's then become the columns of a new dataframe.

> summarize(df, freq=length(x))
  freq
1    9

You are right that length(y) would work just as well; since they are both columns in the same dataframe, they must have the same length.

(The last thing ddply does is take all the dataframes that are returned from the function calls and put them back together into a single dataframe which also includes information on which subset each corresponds to.)

Also, what is the significance of the periods before the second argument in
".(x, y)" ?

The variables to split on can be given "as quoted variables, a formula or character vector". The . is a function in plyr that quotes variables (the first option). The following three are identical:

ddply(df, .(x, y), summarise, freq=length(x))
ddply(df, ~x+y, summarise, freq=length(x))
ddply(df, c("x", "y"), summarise, freq=length(x))

Thanks for the help.

You may also benefit from reading Hadley's paper on the topic:

Hadley Wickham (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software, 40(1), 1-29. http://www.jstatsoft.org/v40/i01/.

On Tue, Jun 21, 2011 at 12:54 PM, Brian Diggs<dig...@ohsu.edu>  wrote:

On 6/21/2011 11:30 AM, Idris Raja wrote:

I have a dataframe df with two columns x and y. I want to count the number
of times a unique x, y combination occurs.

For example

x<- c(1,2,3,4,5,1,2,3,4)
y<- c(1,2,3,4,5,1,2,4,1)

df<-as.data.frame(cbind(x, y))

#what is the correct way to use ddply for this example?
ddply(df, c('x','y', summarize, ??)

#desired output -- format and order doesn't matter
# (x, y) count
#--------------------
# (1, 1) 2
# (2, 2) 2
# (3, 3) 1
# (4, 4) 1
# (5, 5) 1
# (2, 3) 1
# (3, 4) 1
# (4, 1) 1

        [[alternative HTML version deleted]]


Jorge and Dennis gave good responses that get you to the result you asked
for, but for completeness I thought I'd include some ddply versions:

ddply(d, .(x, y), summarize, freq=length(x))

This uses the summarize function you were asking about, however you can
also do it with:

ddply(d, .(x, y), nrow)

or

ddply(d, .(x, y), as.data.frame(nrow))

The latter giving a slightly nicer name (value instead of V1).

As an aside, I prefer using the "summarise" spelling of the function when I
do use it, because it won't clash with Hmisc::summarize.

ddply(d, .(x, y), summarise, freq=length(x))


--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health&  Science University


        [[alternative HTML version deleted]]

--
Brian S. Diggs, PhD
Senior Research Associate, Department of Surgery
Oregon Health & Science University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to