Re: [R] Confidence intervals and polynomial fits

David Winsemius Fri, 06 May 2011 09:32:08 -0700


On May 6, 2011, at 11:35 AM, Ben Haller wrote:

Hi all! I'm getting a model fit from glm() (a binary logisticregression fit, but I don't think that's important) for a formulathat contains powers of the explanatory variable up to fourth. Sothe fit looks something like this (typing into mail; the actual fitcode is complicated because it involves step-down and so forth):
x_sq <- x * x
x_cb <- x * x * x
x_qt <- x * x * x * x
model <- glm(outcome ~ x + x_sq + x_cb + x_qt, binomial)


It might have been easier with:

 model <- glm(outcome ~ poly(x, 4) , binomial)

Since the authors of confint might have been expecting a poly()formulation the results might be more reliable. I'm just using lm()but I think the conclusions are more general:


> set.seed(123)
> x <-seq(0, 4, by=0.1)
> y <-seq(0, 4, by=0.1)^2 +rnorm(41)
> x2 <- x^2
> x3 <- x^3
> summary(lm(y~x+x2+x3 ) )

Call:
lm(formula = y ~ x + x2 + x3)

Residuals:
    Min      1Q  Median      3Q     Max
-1.8528 -0.6146 -0.1164  0.6211  1.8923

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  0.45737    0.52499   0.871   0.3893
x           -0.75989    1.15080  -0.660   0.5131
x2           1.30987    0.67330   1.945   0.0594 .
x3          -0.03559    0.11058  -0.322   0.7494
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9184 on 37 degrees of freedom
Multiple R-squared: 0.9691,     Adjusted R-squared: 0.9666
F-statistic: 386.8 on 3 and 37 DF,  p-value: < 2.2e-16

I wouldn't have been able to understand why a highly significantrelationship overall could not be ascribed to any of the terms. Seewhat happens with poly(x,3)


> summary( lm(y~poly(x,3) ) )

Call:
lm(formula = y ~ poly(x, 3))

Residuals:
    Min      1Q  Median      3Q     Max
-1.8528 -0.6146 -0.1164  0.6211  1.8923

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)   5.4271     0.1434  37.839  < 2e-16 ***
poly(x, 3)1  30.0235     0.9184  32.692  < 2e-16 ***
poly(x, 3)2   8.7823     0.9184   9.563 1.53e-11 ***
poly(x, 3)3  -0.2956     0.9184  -0.322    0.749
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9184 on 37 degrees of freedom
Multiple R-squared: 0.9691,     Adjusted R-squared: 0.9666
F-statistic: 386.8 on 3 and 37 DF,  p-value: < 2.2e-16

> confint( lm(y~poly(x,3) ) )
                2.5 %    97.5 %
(Intercept)  5.136526  5.717749
poly(x, 3)1 28.162706 31.884351
poly(x, 3)2  6.921505 10.643149
poly(x, 3)3 -2.156431  1.565213
> confint(lm(y~x+x2+x3 ) )
                  2.5 %    97.5 %
(Intercept) -0.60636547 1.5211072
x           -3.09164639 1.5718576
x2          -0.05437802 2.6741120
x3          -0.25964705 0.1884609

Neither version exactly captures the simulation input, which wouldhave shown only an effect of the quadratic, but I didn't do anycentering. At least the poly version shows the lower order terms up toquadratic as "significant", whereas it's difficult to extract muchmeaningful out of the separately calculated term version.


--

David.

This works great, and I get a nice fit. My question regards theconfidence intervals on that fit. I am calling:
cis <- confint.default(model, level=0.95)
to get the confidence intervals for each coefficient. Thisconfint.default() call gives me a table like this (where "dispersal"is the real name of what I'm calling "x" above):
                            2.5 %      97.5 %
(Intercept)              8.7214297   8.9842533
dispersal              -37.5621412 -35.6629981
dispersal_sq            66.8077669  74.4490123
dispersal_cb           -74.5963861 -62.8347057
dispersal_qt            22.6590159  28.6506186
A side note: calling confint() instead of confint.default() ismuch, much slower -- my model has many other terms I'm notdiscussing here, and is fit to 300,000 data points -- and a checkindicates that the intervals returned for my model by confint() arenot virtually identical, so this is not the source of my problem:
                            2.5 %      97.5 %
(Intercept)              8.7216693   8.9844958
dispersal              -37.5643922 -35.6652276
dispersal_sq            66.8121519  74.4534753
dispersal_cb           -74.5995820 -62.8377766
dispersal_qt            22.6592724  28.6509494
These tables show the problem: the 95% confidence limits for thequartic term are every bit as wide as the limits for the otherterms. Since the quartic term coefficient gets multiplied by thefourth power of x, this means that the width of the confidence bandstarts out nice and narrow (when x is close to zero, where the widthof the confidence band is pretty much just determined by theconfidence limits on the intercept) but balloons out to betremendously wide for larger values of x. The width of it looksquite unreasonable to me, given that my fit is constrained by300,000 data points.
Intuitively, I would have expected the confidence limits for thequartic term to be much narrower than for, say, the linear term,because the effect of variation in the quartic term is so muchlarger. But the way confidence intervals are calculated in R, atleast, does not seem to follow my instincts here. In other words,predicted values using the 2.5% value of the linear coefficient andthe 97.5% value of the linear coefficient are really not verydifferent, but predicted values using the 2.5% value of thequadratic coefficient and the 97.5% value of the quadraticcoefficient are enormously, wildly divergent -- far beyond the rangeof variability in the original data, in fact. I feel quiteconfident, in a seat-of-the-pants sense, that the actual 95% limitsof that quartic coefficient are much, much narrower than what R isgiving me.
Looking at it another way: if I just exclude the quartic term fromthe glm() fit, I get a dramatically narrower confidence band, eventhough the fit to the data is not nearly as good (I'm keeping thefourth power for good reasons). This again seems to me to indicatethat the confidence band with the quartic term included isartificially, and incorrectly, wide. If a fit with reasonablynarrow confidence limits can be produced for the model with termsonly up to cubic (or, taking this reductio even further, for aquadratic or a linear model, also), why does adding the quadraticterm, which improves the quality of the fit to the data, cause theconfidence limits to nevertheless balloon outwards? That seemsfundamentally illogical to me, but maybe my instincts are bad.
So what's going on here? Is this just a limitation of the wayconfint() computes confidence intervals? Is there a better functionto use, that is compatible with binomial glm() fits? Or am Imisinterpreting the meaning of the confidence limits in some way?Or what?
Thanks for any advice. I've had trouble finding examples ofpolynomial fits like this; people sometimes fit the square of theexplanatory variable, of course, but going up to fourth powers seemsto be quite uncommon, so I've had trouble finding out how othershave dealt with these sorts of issues that crop up only with higherpowers.
Ben Haller
McGill University

http://biology.mcgill.ca/grad/ben/
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Confidence intervals and polynomial fits

Reply via email to