Re: [R] aggregate function - na.action

Gene Leynes Sun, 06 Feb 2011 14:00:48 -0800

By the way, thanks for sending that formula, it's quite thoughtful of you to
send an answer with an actual working line of code!


When I experimented with ddply earlier last week I couldn't figure out the
syntax for a single line aggregation, so it's good to have this example. I
will likely use it for other things.

On Fri, Feb 4, 2011 at 6:54 PM, Ista Zahn <[email protected]> wrote:

> oops. For clarity, that should have been
>
> sum(ddply(dat, .(x1,x2,x3,x4), function(x){data.frame(y.sum=sum(x$y,
> na.rm=TRUE))})$y.sum)
>
> -Ista
>
> On Fri, Feb 4, 2011 at 7:52 PM, Ista Zahn <[email protected]>
> wrote:
> > Hi again,
> >
> > On Fri, Feb 4, 2011 at 7:18 PM, Gene Leynes <[email protected]> wrote:
> >> Ista,
> >>
> >> Thank you again.
> >>
> >> I had figured that out... and was crafting another message when you
> replied.
> >>
> >> The NAs do come though on the variable that is being aggregated,
> >> However, they do not come through on the categorical variable(s).
> >>
> >> The aggregate function must be converting the data frame variables to
> >> factors, with the default "omit=NA" parameter.
> >>
> >> The help on "aggregate" says:
> >> na.action     A function which indicates what should happen when the
> data
> >> contain NA values.
> >>               The default is to ignore missing values in the given
> >> variables.
> >> By "data" it must only refer to the aggregated variable, and not the
> >> categorical variables.  I thought it referred to both, because I thought
> it
> >> referred to the "data" argument, which is the underlying data frame.
> >>
> >> I think the proper way to accomplish this would be to recast my x
> >> (categorical) variables as factors.
> >
> > Yes, that would work.
> >
> > This is not feasible for me due to
> >> other complications.
> >> Also, (imho) the help should be more clear about what the na.action
> >> modifies.
> >>
> >> So, unless someone has a better idea, I guess I'm out of luck?
> >
> > Well, you can use ddply from the plyr package:
> >
> > library(plyr) # may need to install first.
> > sum(ddply(dat, .(x1,x2,x3,x4), function(x){data.frame(y.sum=sum(x$y,
> > na.rm=TRUE))})$y)
> >
> > However, I don't think you've told us what you're actually trying to
> > accomplish...
> >
> > Best,
> > Ista
> >
> >>
> >>
> >> On Fri, Feb 4, 2011 at 6:05 PM, Ista Zahn <[email protected]>
> wrote:
> >>>
> >>> Hi,
> >>>
> >>> On Fri, Feb 4, 2011 at 6:33 PM, Gene Leynes <[email protected]>
> wrote:
> >>> > Thank you both for the thoughtful (and funny) replies.
> >>> >
> >>> > I agree with both of you that sum is the one picking up aggregate.
> >>> > Although
> >>> > I didn't mention it, I did realize that in the first place.
> >>> > Also, thank you Phil for pointing out that aggregate only accepts a
> >>> > formula
> >>> > value in more recent versions!  I actually thought that was an older
> >>> > feature, but I must be thinking of other functions.
> >>> >
> >>> > I still don't see why these two values are not the same!
> >>> >
> >>> > It seems like a bug to me
> >>>
> >>> No, not a bug (see below).
> >>>
> >>> >
> >>> >> set.seed(100)
> >>> >> dat=data.frame(
> >>> > +         x1=sample(c(NA,'m','f'), 100, replace=TRUE),
> >>> > +         x2=sample(c(NA, 1:10), 100, replace=TRUE),
> >>> > +         x3=sample(c(NA,letters[1:5]), 100, replace=TRUE),
> >>> > +         x4=sample(c(NA,T,F), 100, replace=TRUE),
> >>> > +         y=sample(c(rep(NA,5), rnorm(95))))
> >>> >> sum(dat$y, na.rm=T)
> >>> > [1] 0.0815244116598
> >>> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.action=na.pass,
> >>> >> na.rm=T)$y)
> >>> > [1] -4.45087666247
> >>> >>
> >>>
> >>> Because in the first one you are only removing missing data in dat$y.
> >>> In the second one you are removeing all rows that contain missing data
> >>> in any of the columns.
> >>>
> >>> all.equal(sum(na.omit(dat)$y), sum(aggregate(y~x1+x2+x3+x4, data=dat,
> >>> sum, na.action=na.pass, na.rm=T)$y))
> >>> [1] TRUE
> >>>
> >>> Best,
> >>> Ista
> >>>
> >>> >
> >>> >
> >>> >
> >>> > On Fri, Feb 4, 2011 at 4:18 PM, Ista Zahn <[email protected]
> >
> >>> > wrote:
> >>> >>
> >>> >> Sorry, I didn't see Phil's reply, which is better than mine anyway.
> >>> >>
> >>> >> -Ista
> >>> >>
> >>> >> On Fri, Feb 4, 2011 at 5:16 PM, Ista Zahn <
> [email protected]>
> >>> >> wrote:
> >>> >> > Hi,
> >>> >> >
> >>> >> > Please see ?na.action
> >>> >> >
> >>> >> > (just kidding!)
> >>> >> >
> >>> >> > So it seems to me the problem is that you are passing na.rm to the
> >>> >> > sum
> >>> >> > function. So there is no missing data for the na.action argument
> to
> >>> >> > operate on!
> >>> >> >
> >>> >> > Compare
> >>> >> >
> >>> >> > sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.action=na.fail)$y)
> >>> >> > sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.action=na.pass)$y)
> >>> >> > sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.action=na.omit)$y)
> >>> >> >
> >>> >> >
> >>> >> > Best,
> >>> >> > Ista
> >>> >> >
> >>> >> > On Fri, Feb 4, 2011 at 4:07 PM, Gene Leynes <[email protected]>
> >>> >> > wrote:
> >>> >> >> Can someone please tell me what is up with na.action in
> aggregate?
> >>> >> >>
> >>> >> >> My (somewhat) reproducible example:
> >>> >> >> (I say somewhat because some lines wouldn't run in a separate
> >>> >> >> session,
> >>> >> >> more
> >>> >> >> below)
> >>> >> >>
> >>> >> >> set.seed(100)
> >>> >> >> dat=data.frame(
> >>> >> >>        x1=sample(c(NA,'m','f'), 100, replace=TRUE),
> >>> >> >>        x2=sample(c(NA, 1:10), 100, replace=TRUE),
> >>> >> >>        x3=sample(c(NA,letters[1:5]), 100, replace=TRUE),
> >>> >> >>        x4=sample(c(NA,T,F), 100, replace=TRUE),
> >>> >> >>        y=sample(c(rep(NA,5), rnorm(95))))
> >>> >> >> dat
> >>> >> >> ## The total from dat:
> >>> >> >> sum(dat$y, na.rm=T)
> >>> >> >> ## The total from aggregate:
> >>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T)$x)
> >>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T)$y)  ## <---
> >>> >> >> This
> >>> >> >> line
> >>> >> >> gave an error in a separate R instance
> >>> >> >> ## The aggregate formula is excluding NA
> >>> >> >>
> >>> >> >> ## So, let's try to include NAs
> >>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
> >>> >> >> na.action='na.pass')$y)
> >>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
> >>> >> >> na.action=na.pass)$y)
> >>> >> >> ## The aggregate formula is STILL excluding NA
> >>> >> >> ## In fact, the formula doesn't seem to notice the na.action
> >>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
> na.action='foo
> >>> >> >> man
> >>> >> >> chew')$y)
> >>> >> >> ## Hmmmm... that error surprised me (since the previous two
> things
> >>> >> >> ran)
> >>> >> >>
> >>> >> >> ## So, let's try to change the global options
> >>> >> >> ## (not mentioned in the help, but after reading the help
> >>> >> >> ##  100 times, I thought I would go above and beyond to avoid
> >>> >> >> ##  any r list flames from people complaining
> >>> >> >> ##  that I didn't read the help... but that's a separate topic)
> >>> >> >> options(na.action ="na.pass")
> >>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T)$x)
> >>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T)$y)
> >>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
> >>> >> >> na.action='na.pass')$y)
> >>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T,
> >>> >> >> na.action=na.pass)$y)
> >>> >> >> ## (NAs are still omitted)
> >>> >> >>
> >>> >> >> ## Even more frustrating...
> >>> >> >> ## Why don't any of these work???
> >>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T,
> >>> >> >> na.action='na.pass')$x)
> >>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T,
> na.action=na.pass)$x)
> >>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T,
> >>> >> >> na.action='na.omit')$x)
> >>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T,
> na.action=na.omit)$x)
> >>> >> >>
> >>> >> >>
> >>> >> >> ## This does work, but in my real data set, I want NA to really
> be
> >>> >> >> NA
> >>> >> >> for(j in 1:4)
> >>> >> >>    dat[is.na(dat[,j]),j] = 'NA'
> >>> >> >> sum(aggregate(dat$y, dat[,1:4], sum, na.rm=T)$x)
> >>> >> >> sum(aggregate(y~x1+x2+x3+x4, data=dat, sum, na.rm=T)$y)
> >>> >> >>
> >>> >> >>
> >>> >> >> ## My first session info
> >>> >> >> #
> >>> >> >> #> sessionInfo()
> >>> >> >> #R version 2.12.0 (2010-10-15)
> >>> >> >> #Platform: i386-pc-mingw32/i386 (32-bit)
> >>> >> >> #
> >>> >> >> #locale:
> >>> >> >> #        [1] LC_COLLATE=English_United States.1252
> >>> >> >> #[2] LC_CTYPE=English_United States.1252
> >>> >> >> #[3] LC_MONETARY=English_United States.1252
> >>> >> >> #[4] LC_NUMERIC=C
> >>> >> >> #[5] LC_TIME=English_United States.1252
> >>> >> >> #
> >>> >> >> #attached base packages:
> >>> >> >> #        [1] stats     graphics  grDevices utils     datasets
> >>> >> >>  methods
> >>> >> >> base
> >>> >> >> #
> >>> >> >> #other attached packages:
> >>> >> >> #        [1] plyr_1.2.1  zoo_1.6-4   gdata_2.8.1 rj_0.5.0-5
> >>> >> >> #
> >>> >> >> #loaded via a namespace (and not attached):
> >>> >> >> #        [1] grid_2.12.0     gtools_2.6.2    lattice_0.19-13
> >>> >> >> rJava_0.8-8
> >>> >> >> #[5] tools_2.12.0
> >>> >> >>
> >>> >> >>
> >>> >> >>
> >>> >> >> I tried running that example in a different version of R, with
> and I
> >>> >> >> got
> >>> >> >> completely different results
> >>> >> >>
> >>> >> >> The other version of R wouldn't recognize the formula at all..
> >>> >> >>
> >>> >> >> My other version of R:
> >>> >> >>
> >>> >> >> #  My second session info
> >>> >> >> #> sessionInfo()
> >>> >> >> #R version 2.10.1 (2009-12-14)
> >>> >> >> #i386-pc-mingw32
> >>> >> >> #
> >>> >> >> #locale:
> >>> >> >> #        [1] LC_COLLATE=English_United States.1252
> >>> >> >> #[2] LC_CTYPE=English_United States.1252
> >>> >> >> #[3] LC_MONETARY=English_United States.1252
> >>> >> >> #[4] LC_NUMERIC=C
> >>> >> >> #[5] LC_TIME=English_United States.1252
> >>> >> >> #
> >>> >> >> #attached base packages:
> >>> >> >> #        [1] stats     graphics  grDevices utils     datasets
> >>> >> >>  methods
> >>> >> >> base
> >>> >> >> #>
> >>> >> >> #
> >>> >> >>
> >>> >> >> PS: Also, I have read the help on aggregate, factor, as.factor,
> and
> >>> >> >> several
> >>> >> >> other topics.  If I missed something, please let me know.
> >>> >> >> Some people like to reply to questions by telling the sender that
> R
> >>> >> >> has
> >>> >> >> documentation.  Please don't.  The R help archives are littered
> with
> >>> >> >> reminders, friendly and otherwise, of R's documentation.
> >>> >> >>
> >>> >> >>        [[alternative HTML version deleted]]
> >>> >> >>
> >>> >> >> ______________________________________________
> >>> >> >> [email protected] mailing list
> >>> >> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> >> >> PLEASE do read the posting guide
> >>> >> >> http://www.R-project.org/posting-guide.html
> >>> >> >> and provide commented, minimal, self-contained, reproducible
> code.
> >>> >> >>
> >>> >> >
> >>> >> >
> >>> >> >
> >>> >> > --
> >>> >> > Ista Zahn
> >>> >> > Graduate student
> >>> >> > University of Rochester
> >>> >> > Department of Clinical and Social Psychology
> >>> >> > http://yourpsyche.org
> >>> >> >
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Ista Zahn
> >>> >> Graduate student
> >>> >> University of Rochester
> >>> >> Department of Clinical and Social Psychology
> >>> >> http://yourpsyche.org
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Ista Zahn
> >>> Graduate student
> >>> University of Rochester
> >>> Department of Clinical and Social Psychology
> >>> http://yourpsyche.org
> >>
> >>
> >
> >
> >
> > --
> > Ista Zahn
> > Graduate student
> > University of Rochester
> > Department of Clinical and Social Psychology
> > http://yourpsyche.org
> >
>
>
>
> --
> Ista Zahn
> Graduate student
> University of Rochester
> Department of Clinical and Social Psychology
> http://yourpsyche.org
>

        [[alternative HTML version deleted]]

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] aggregate function - na.action

Reply via email to