Hi all!  I'm getting a model fit from glm() (a binary logistic regression 
fit, but I don't think that's important) for a formula that contains powers of 
the explanatory variable up to fourth.  So the fit looks something like this 
(typing into mail; the actual fit code is complicated because it involves 
step-down and so forth):

x_sq <- x * x
x_cb <- x * x * x
x_qt <- x * x * x * x
model <- glm(outcome ~ x + x_sq + x_cb + x_qt, binomial)

  This works great, and I get a nice fit.  My question regards the confidence 
intervals on that fit.  I am calling:

cis <- confint.default(model, level=0.95)

to get the confidence intervals for each coefficient.  This confint.default() 
call gives me a table like this (where "dispersal" is the real name of what I'm 
calling "x" above):

                             2.5 %      97.5 %
(Intercept)              8.7214297   8.9842533
dispersal              -37.5621412 -35.6629981
dispersal_sq            66.8077669  74.4490123
dispersal_cb           -74.5963861 -62.8347057
dispersal_qt            22.6590159  28.6506186

  A side note: calling confint() instead of confint.default() is much, much 
slower -- my model has many other terms I'm not discussing here, and is fit to 
300,000 data points -- and a check indicates that the intervals returned for my 
model by confint() are not virtually identical, so this is not the source of my 
problem:

                             2.5 %      97.5 %
(Intercept)              8.7216693   8.9844958
dispersal              -37.5643922 -35.6652276
dispersal_sq            66.8121519  74.4534753
dispersal_cb           -74.5995820 -62.8377766
dispersal_qt            22.6592724  28.6509494

  These tables show the problem: the 95% confidence limits for the quartic term 
are every bit as wide as the limits for the other terms.  Since the quartic 
term coefficient gets multiplied by the fourth power of x, this means that the 
width of the confidence band starts out nice and narrow (when x is close to 
zero, where the width of the confidence band is pretty much just determined by 
the confidence limits on the intercept) but balloons out to be tremendously 
wide for larger values of x.  The width of it looks quite unreasonable to me, 
given that my fit is constrained by 300,000 data points.

  Intuitively, I would have expected the confidence limits for the quartic term 
to be much narrower than for, say, the linear term, because the effect of 
variation in the quartic term is so much larger.  But the way confidence 
intervals are calculated in R, at least, does not seem to follow my instincts 
here.  In other words, predicted values using the 2.5% value of the linear 
coefficient and the 97.5% value of the linear coefficient are really not very 
different, but predicted values using the 2.5% value of the quadratic 
coefficient and the 97.5% value of the quadratic coefficient are enormously, 
wildly divergent -- far beyond the range of variability in the original data, 
in fact.  I feel quite confident, in a seat-of-the-pants sense, that the actual 
95% limits of that quartic coefficient are much, much narrower than what R is 
giving me.

  Looking at it another way: if I just exclude the quartic term from the glm() 
fit, I get a dramatically narrower confidence band, even though the fit to the 
data is not nearly as good (I'm keeping the fourth power for good reasons).  
This again seems to me to indicate that the confidence band with the quartic 
term included is artificially, and incorrectly, wide.  If a fit with reasonably 
narrow confidence limits can be produced for the model with terms only up to 
cubic (or, taking this reductio even further, for a quadratic or a linear 
model, also), why does adding the quadratic term, which improves the quality of 
the fit to the data, cause the confidence limits to nevertheless balloon 
outwards?  That seems fundamentally illogical to me, but maybe my instincts are 
bad.

  So what's going on here?  Is this just a limitation of the way confint() 
computes confidence intervals?  Is there a better function to use, that is 
compatible with binomial glm() fits?  Or am I misinterpreting the meaning of 
the confidence limits in some way?  Or what?

  Thanks for any advice.  I've had trouble finding examples of polynomial fits 
like this; people sometimes fit the square of the explanatory variable, of 
course, but going up to fourth powers seems to be quite uncommon, so I've had 
trouble finding out how others have dealt with these sorts of issues that crop 
up only with higher powers.

Ben Haller
McGill University

http://biology.mcgill.ca/grad/ben/
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to