Re: [R] Using summaryBy with weighted data

Dennis Murphy Mon, 17 Jan 2011 13:12:10 -0800

Hi:

Does this do what you need?


wstats <- function(d) {
             require(Hmisc)
             N <- length(d$response[!is.na(d$response)])
             c(WM = wtd.mean(d$response, d$weights),
               WSE = sqrt(wtd.var(d$response, d$weights)),
               N = N)
    }
library(plyr)
ddply(mydata, .(group), wstats)
  group            WM       WSE  N
1     1  0.1302255752 1.1911298 20
2     2 -0.2814664362 0.8582928 20
3     3 -0.3640550516 1.2618343 20
4     4  0.0002852392 1.1463205 20
5     5 -0.0070283053 1.2315683 20

The trick to writing this function for input into plyr is that the argument
is a data frame. When called in ddply(), the function wstats() will be
applied to each sub-frame corresponding to the grouping factor(s). Inside
it, the variables of interest are extracted relative to the input data frame
and the three quantities are computed. I used wtd.mean() and wtd.var() from
Hmisc, as both will remove NAs by default. In the ddply call, the function
name is simply cited since a sub-data frame is the sole argument of the
function.

I couldn't figure out how to get doBy to get this to work, as it seems best
suited to functions of one argument (a single response), but here's an
alternative using the data.table package:

library(data.table)
# Assumes Hmisc is already loaded...
myDT <- data.table(mydata, key = 'group')
myDT[, list(N = length(response[!is.na(response)]),
             wtdMean = wtd.mean(response, weights),
             wtdSE = sqrt(wtd.var(response, weights))), by = 'group']
     group  N       wtdMean     wtdSE
[1,]     1 20  0.1302255752 1.1911298
[2,]     2 20 -0.2814664362 0.8582928
[3,]     3 20 -0.3640550516 1.2618343
[4,]     4 20  0.0002852392 1.1463205
[5,]     5 20 -0.0070283053 1.2315683

data.table uses a different model of data organization from data frames. A
simplistic description is that it you can think of a data.table as analogous
to a table in a DBMS. Notice that the 'function call' is indexed inside the
data table: the first 'subscript' corresponds to what are called I()
operations  (analogous to 'select' statements in an SQL); the second
'subscript' corresponds to J() operations, (analogous to  'where'
statements), while the third argument is the by group(s), or sub-data
tables, to which (in this case) the J() operations apply.

For functions that take multiple arguments and that are meant to be applied
in a groupwise fashion, I find plyr and data.table to be very good options.
There are also base package alternatives (e.g., some combination of
lapply(), mapply() and do.call()) and several other packages, but plyr and
data.table are generally pretty good at handling most of the niggling
details. Having said that, both have learning curves - data.table, in
particular, will be much easier to pick up if you have some background in
SQLs, since its syntax uses primary principles of SQL.

data.table has a vignette and FAQ, along with an independent help list - for
details, see its page on R-forge:
http://r-forge.r-project.org/projects/datatable/
For plyr's documentation, see
http://had.co.nz/plyr/
A link to its mailing list is found on that page as well.

HTH,
Dennis

On Mon, Jan 17, 2011 at 10:24 AM, Solomon Messing <solomon.mess...@gmail.com
> wrote:

> Thanks Josh.  I built on your example and ended up with the code below--if
> you or anyone sees any issues please let me know.  It would be great if
> there were a slicker way to get these kinds of summary stats in R, but this
> gets the job done.
>
> # takes data frame z with weights w and data x, returns weighted mean,
> weighted SE, and N
> msenw = function(z){
>        N = length(na.omit(z)$response)
>        i = which(!is.na(z$response))
>        return(
>                        c( W.M = weighted.mean(z$response, z$weights,
> na.rm=T),
>                        W.SE = sqrt(wtd.var(z$response, weights =
> z$weights))/sqrt(sum(z$weights[i])),
>                        N=N ) )
> }
>
> library(doBy)
> library(Hmisc)
> ## make up some data (easier)
> mydata <- data.frame(response = rnorm(100),
>                group = rep(1:5, each = 20), weights = runif(100, 0, 1))
>
> xy <- by(mydata, mydata$group, msenw)
> data.frame( group = names(c(xy)), do.call(rbind, xy) )
>
> ## can be extended to other data using:
> xy <- by(data.frame(response = mydata$response, weights = mydata$weights),
> mydata$group, msenw)
>
>
> Solomon Messing
> www.stanford.edu/~messing <http://www.stanford.edu/%7Emessing>
>
>
>
> On Jan 16, 2011, at 11:16 PM, Joshua Wiley wrote:
>
> > Dear Solomon,
> >
> > On Sun, Jan 16, 2011 at 10:27 PM, Solomon Messing
> > <solomon.mess...@gmail.com> wrote:
> >> Dear Soren and R users:
> >>
> >> I am trying to use the summaryBy function with weights.  Is this
> possible?  An example that illustrates what I am trying to do follows:
> >>
> >> library(doBy)
> >> ## make up some data
> >> response = rnorm(100)
> >> group = c(rep(1,20), rep(2,20), rep(3,20), rep(4,20), rep(5,20))
> >> weights = runif(100, 0, 1)
> >> mydata = data.frame(response,group,weights)
> >>
> >> ## run summaryBy without weights:
> >> summaryBy(response~group, data = mydata, FUN = mean)
> >>
> >> ## attempt to run summaryBy with weights, throws error
> >> summaryBy(x~group, data = mydata, FUN = weighted.mean, w=weights )
> >>
> >> ## throws the error:
> >> # Error in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) {
> :
> >> #                                       arguments must have same length
> >>
> >> My guess is that summaryBy is not giving weighted.mean() each group of
> weights, but instead is passing all of the weights in the data set each time
> it calls weighted.mean().
> >
> > Yes, of course.  It has no way of knowing that the weights should also
> > be being broken down by group....they are not in the formula.
> >
> >>  Do you know if there is some way to get summaryBy to pass weights to
> weighted.mean() only for each group?
> >
> > Ideally there would be a way to pass more than one variable to a
> > function (e.g., response and weights) or just an entire object
> > (mydata) broken down by group.  Then you would just make a wrapper
> > function to pass the right values to the x and w arguments of
> > weighted.mean.  Instead here is a somewhat hacked version:
> >
> > library(doBy)
> > ## make up some data (easier)
> > mydata <- data.frame(response = rnorm(100),
> > group = rep(1:5, each = 20), weights = runif(100, 0, 1))
> >
> > ## manually compute weighted mean
> > tmp <- summaryBy(response*weights ~ group, data = mydata, FUN = sum)
> > tmp[,2] <- tmp[,2]/with(mydata, tapply(weights, group, sum))
> > tmp ## weighted means
> >
> > ## here's the 'problem', if you will, even with  +, they are passed
> > one at a time
> > summaryBy(response + weights ~ group, data = mydata, FUN = str)
> > summaryBy(mydata ~ group, data = mydata, FUN = str)
> >
> > ## here is an option using by():
> > xy <- by(mydata, mydata$group, function(z) weighted.mean(z$response,
> z$weights))
> > xy
> > ## if you don't like the formatting....
> > data.frame(group = names(c(xy)), weighted.mean = c(xy))
> >
> > HTH,
> >
> > Josh
> >
> >>
> >> I suspect this functionality would be a tremendous benefit to R users
> who regularly work with weighted data, such as myself.
> >>
> >> Thanks,
> >>
> >> Solomon Messing
> >> www.stanford.edu/~messing <http://www.stanford.edu/%7Emessing>
> >>
> >> PS I know this basic example can be done using lapply(split(...))
> approach referenced here:
> >>
> >> http://www.mail-archive.com/r-help@stat.math.ethz.ch/msg12349.html
> >>
> >> but for more complex tasks the lapply approach will mean writing a lot
> of extra code to run everything and then to get things formatted as nicely
> as summaryBy() was designed to do.
> >>
> >>
> >>        [[alternative HTML version deleted]]
> >>
> >> ______________________________________________
> >> R-help@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
> >
> > --
> > Joshua Wiley
> > Ph.D. Student, Health Psychology
> > University of California, Los Angeles
> > http://www.joshuawiley.com/
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Using summaryBy with weighted data

Reply via email to