On May 6, 2011, at 11:35 AM, Ben Haller wrote:
Hi all! I'm getting a model fit from glm() (a binary logistic
regression fit, but I don't think that's important) for a formula
that contains powers of the explanatory variable up to fourth. So
the fit looks something like this (typing into mail; the actual fit
code is complicated because it involves step-down and so forth):
x_sq <- x * x
x_cb <- x * x * x
x_qt <- x * x * x * x
model <- glm(outcome ~ x + x_sq + x_cb + x_qt, binomial)
It might have been easier with:
model <- glm(outcome ~ poly(x, 4) , binomial)
Since the authors of confint might have been expecting a poly()
formulation the results might be more reliable. I'm just using lm()
but I think the conclusions are more general:
> set.seed(123)
> x <-seq(0, 4, by=0.1)
> y <-seq(0, 4, by=0.1)^2 +rnorm(41)
> x2 <- x^2
> x3 <- x^3
> summary(lm(y~x+x2+x3 ) )
Call:
lm(formula = y ~ x + x2 + x3)
Residuals:
Min 1Q Median 3Q Max
-1.8528 -0.6146 -0.1164 0.6211 1.8923
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.45737 0.52499 0.871 0.3893
x -0.75989 1.15080 -0.660 0.5131
x2 1.30987 0.67330 1.945 0.0594 .
x3 -0.03559 0.11058 -0.322 0.7494
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9184 on 37 degrees of freedom
Multiple R-squared: 0.9691, Adjusted R-squared: 0.9666
F-statistic: 386.8 on 3 and 37 DF, p-value: < 2.2e-16
I wouldn't have been able to understand why a highly significant
relationship overall could not be ascribed to any of the terms. See
what happens with poly(x,3)
> summary( lm(y~poly(x,3) ) )
Call:
lm(formula = y ~ poly(x, 3))
Residuals:
Min 1Q Median 3Q Max
-1.8528 -0.6146 -0.1164 0.6211 1.8923
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.4271 0.1434 37.839 < 2e-16 ***
poly(x, 3)1 30.0235 0.9184 32.692 < 2e-16 ***
poly(x, 3)2 8.7823 0.9184 9.563 1.53e-11 ***
poly(x, 3)3 -0.2956 0.9184 -0.322 0.749
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9184 on 37 degrees of freedom
Multiple R-squared: 0.9691, Adjusted R-squared: 0.9666
F-statistic: 386.8 on 3 and 37 DF, p-value: < 2.2e-16
> confint( lm(y~poly(x,3) ) )
2.5 % 97.5 %
(Intercept) 5.136526 5.717749
poly(x, 3)1 28.162706 31.884351
poly(x, 3)2 6.921505 10.643149
poly(x, 3)3 -2.156431 1.565213
> confint(lm(y~x+x2+x3 ) )
2.5 % 97.5 %
(Intercept) -0.60636547 1.5211072
x -3.09164639 1.5718576
x2 -0.05437802 2.6741120
x3 -0.25964705 0.1884609
Neither version exactly captures the simulation input, which would
have shown only an effect of the quadratic, but I didn't do any
centering. At least the poly version shows the lower order terms up to
quadratic as "significant", whereas it's difficult to extract much
meaningful out of the separately calculated term version.
--
David.
This works great, and I get a nice fit. My question regards the
confidence intervals on that fit. I am calling:
cis <- confint.default(model, level=0.95)
to get the confidence intervals for each coefficient. This
confint.default() call gives me a table like this (where "dispersal"
is the real name of what I'm calling "x" above):
2.5 % 97.5 %
(Intercept) 8.7214297 8.9842533
dispersal -37.5621412 -35.6629981
dispersal_sq 66.8077669 74.4490123
dispersal_cb -74.5963861 -62.8347057
dispersal_qt 22.6590159 28.6506186
A side note: calling confint() instead of confint.default() is
much, much slower -- my model has many other terms I'm not
discussing here, and is fit to 300,000 data points -- and a check
indicates that the intervals returned for my model by confint() are
not virtually identical, so this is not the source of my problem:
2.5 % 97.5 %
(Intercept) 8.7216693 8.9844958
dispersal -37.5643922 -35.6652276
dispersal_sq 66.8121519 74.4534753
dispersal_cb -74.5995820 -62.8377766
dispersal_qt 22.6592724 28.6509494
These tables show the problem: the 95% confidence limits for the
quartic term are every bit as wide as the limits for the other
terms. Since the quartic term coefficient gets multiplied by the
fourth power of x, this means that the width of the confidence band
starts out nice and narrow (when x is close to zero, where the width
of the confidence band is pretty much just determined by the
confidence limits on the intercept) but balloons out to be
tremendously wide for larger values of x. The width of it looks
quite unreasonable to me, given that my fit is constrained by
300,000 data points.
Intuitively, I would have expected the confidence limits for the
quartic term to be much narrower than for, say, the linear term,
because the effect of variation in the quartic term is so much
larger. But the way confidence intervals are calculated in R, at
least, does not seem to follow my instincts here. In other words,
predicted values using the 2.5% value of the linear coefficient and
the 97.5% value of the linear coefficient are really not very
different, but predicted values using the 2.5% value of the
quadratic coefficient and the 97.5% value of the quadratic
coefficient are enormously, wildly divergent -- far beyond the range
of variability in the original data, in fact. I feel quite
confident, in a seat-of-the-pants sense, that the actual 95% limits
of that quartic coefficient are much, much narrower than what R is
giving me.
Looking at it another way: if I just exclude the quartic term from
the glm() fit, I get a dramatically narrower confidence band, even
though the fit to the data is not nearly as good (I'm keeping the
fourth power for good reasons). This again seems to me to indicate
that the confidence band with the quartic term included is
artificially, and incorrectly, wide. If a fit with reasonably
narrow confidence limits can be produced for the model with terms
only up to cubic (or, taking this reductio even further, for a
quadratic or a linear model, also), why does adding the quadratic
term, which improves the quality of the fit to the data, cause the
confidence limits to nevertheless balloon outwards? That seems
fundamentally illogical to me, but maybe my instincts are bad.
So what's going on here? Is this just a limitation of the way
confint() computes confidence intervals? Is there a better function
to use, that is compatible with binomial glm() fits? Or am I
misinterpreting the meaning of the confidence limits in some way?
Or what?
Thanks for any advice. I've had trouble finding examples of
polynomial fits like this; people sometimes fit the square of the
explanatory variable, of course, but going up to fourth powers seems
to be quite uncommon, so I've had trouble finding out how others
have dealt with these sorts of issues that crop up only with higher
powers.
Ben Haller
McGill University
http://biology.mcgill.ca/grad/ben/
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.