On Nov 14, 2007 5:17 AM, Peter Dalgaard <[EMAIL PROTECTED]> wrote: > > Dylan Beaudette wrote: > > On Tuesday 13 November 2007, Peter Dalgaard wrote: > > > >> Prof Brian Ripley wrote: > >> > >>> On Tue, 13 Nov 2007, Dylan Beaudette wrote: > >>> > >>>> Hi, > >>>> > >>>> I have setup a simple logistic regression model with the glm() function, > >>>> with the follow formula: > >>>> > >>>> y ~ a + b > >>>> > >>>> where: > >>>> 'a' is a continuous variable stratified by > >>>> the levels of 'b' > >>>> > >>>> > >>>> Looking over the manual for model specification, it seems that > >>>> coefficients for unordered factors are given 'against' the first level > >>>> of that factor. > >>>> > >>> Only for the default coding. > >>> > >>> > >>>> This makes for difficult interpretation when using factor 'b' as a > >>>> stratifying model term. > >>>> > >>> Really? You realize that you have not 'stratified' on 'b', which would > >>> need the model to be a*b? What you have is a model with parallel linear > >>> predictors for different levels of 'b', and if the coefficients are not > >>> telling you what you want you should change the coding. > >>> > > > > Thanks for the comments Peter. Note that use/abuse of jargon on my part is > > augmented by my limited understanding. > > > > > >> I have to differ slightly here. "Stratification", at least in the fields > >> that I connect with, usually means that you combine information from > >> several groups via an assumption that they have a common value of a > >> parameter, which in the present case is essentially the same as assuming > >> an additive model y~a+b. > >> > > > > This was the definition of 'stratification' that I was using when > > formulating > > this model. > > > > > >> I share your confusion as to why the parametrization of the effects of > >> factor b should matter, though. Surely, the original poster has already > >> noticed that the estimated effect of a is the same whether or not the > >> intercept is included? The only difference I see is that the running > >> anova() or drop1() would not give interesting results for the effect of > >> b in the no-intercept variation. > >> > > > > Yes, I have noticed that the estimated effect is the same. It looks like I > > am > > having trouble interpreting the model formula and coefficients. An example > > from MASS, using the default R contrasts: > > > > library(MASS) > > data(whiteside) > > > > # simple additive model, relative to first level of 'Insul' > > # 'parallel regressions' > > summary(lm(Gas ~ Temp + Insul, data=whiteside)) > > > > Estimate Std. Error t value Pr(>|t|) > > (Intercept) 6.55133 0.11809 55.48 <2e-16 *** > > Temp -0.33670 0.01776 -18.95 <2e-16 *** > > InsulAfter -1.56520 0.09705 -16.13 <2e-16 *** > > > > > > # same model without the intercept > > summary(lm(Gas ~ Temp + Insul -1 , data=whiteside)) > > > > Estimate Std. Error t value Pr(>|t|) > > Temp -0.33670 0.01776 -18.95 <2e-16 *** > > InsulBefore 6.55133 0.11809 55.48 <2e-16 *** > > InsulAfter 4.98612 0.10268 48.56 <2e-16 *** > > > > In presenting the terms of this model, I would like to be able to talk about > > the physical meaning of the coefficients. Minus the intercept term, the > > second example presents the slopes before and after. I now understand how to > > interpret the first example, as: > > > > Intercept - 1.56520 = InsulAfter > > 6.55133 - 1.56520 = 4.98612 > > > > ... the meaning of the coeficient for InsulAfter is relative to that of the > > Intercept (treatment style contrasts). > > > > With the above example the significance terms do no really change when the > > intercept is removed. Within the context of my data, removing the intercept > > has the following effect: > > > > # with intercept > > Estimate Std. Error z value Pr(>|z|) > > (Intercept) 7.296e+00 1.404e+00 5.196 2.03e-07 *** > > bd_sum -8.331e-04 1.533e-04 -5.433 5.55e-08 *** > > clastic_volcanic -1.396e-01 8.209e-01 -0.170 0.86495 > > coarse_sedimentary -2.720e+00 8.634e-01 -3.150 0.00163 ** > > felsic_intrusive -3.862e-01 8.404e-01 -0.460 0.64583 > > fine_sedimentary -1.010e+00 8.795e-01 -1.149 0.25069 > > rhyolite -8.420e-01 8.531e-01 -0.987 0.32365 > > tuff 1.569e+01 1.008e+03 0.016 0.98758 > > > > # without intercept: > > Estimate Std. Error z value Pr(>|z|) > > bd_sum -8.331e-04 1.533e-04 -5.433 5.55e-08 *** > > andesite 7.296e+00 1.404e+00 5.196 2.03e-07 *** > > clastic_volcanic 7.156e+00 1.276e+00 5.608 2.05e-08 *** > > coarse_sedimentary 4.576e+00 1.158e+00 3.951 7.77e-05 *** > > felsic_intrusive 6.910e+00 1.252e+00 5.520 3.39e-08 *** > > fine_sedimentary 6.286e+00 1.279e+00 4.916 8.83e-07 *** > > rhyolite 6.454e+00 1.289e+00 5.005 5.57e-07 *** > > tuff 2.299e+01 1.008e+03 0.023 0.982 > > > > > > ... the meaning of the coefficients now makes sense (I think!), however the > > significance of each level of the 'stratifying' factor has changed > > considerably. How can I interpret that change? > > > > > >
Peter: > (This is not raw R output, is it? The name of the stratifying factor > seems to be missing.) > Sorry about that, it was copied right out of R, however I removed the stratifying factor prefix for brevity. > If you add the Intercept to each term in the first analysis, as well as > to zero for andesite, you get the numbers in the second analysis, or > conversely subtract andesite from each term in in the 2nd analysis and > get those in the first. This is now clear to me. It took writing out both forms in the last email to fully realize that. (I am a soil scientist by trade... trying to catch-up in stats) > In the first display, you can immediately see that (the intercept of) > the line for coarse_sedimentary is significantly lower than that for > andesite. Since the regression coefficient on bd_sum is assumed the > same for all strata, the difference in intercepts is actually the > difference for any fixed value of bd_sum. The other factor levels are > not significantly different from andesite. Ok-- this was what I was really struggling with here- interpreting the marginal t-tests. I was interpreting the marginal t-tests as tests of significance of the levels of my stratifying factor as predictors of my response. I will read up on the output from summary.glm some more... > In the second display, the significance tests are for whether the > intercept for each per-stratum line can be zero. I.e. whether the value > of the linear predictor is zero when bd_sum is zero, which in turn, > since this is a logistic model, implies that the predicted probability > is 0.5. This is probably not a relevant hypothesis at all, but all > intercepts except the last one are significantly non-zero. "Tuff" > appears to be anomalous; it has a large coefficient and a huge s.e. -- > presumably the response is all 1's in this group? Yikes. You are correct, this is not a hypothesis worth testing in this case. The 'Tuff' case is as you suggested- all response observations were 1's . Some follow-up: say we are using the same model: logit(y) ~ temp + geology the logit of the response is based on the influence of temperature, and the underlying geologic type. The influence of temperature should be the same everywhere, however the geologic type alters several other physical parameters which cannot be directly modeled. With the above advice I think that I can now interpret the meaning of the marginal t-tests. However, I am not really interested how a certain geologic type's coefficient is different than that of the first level of that factor. Would another model form help me with that-- # regression on each level of stratifying factor # explicit removal of the intercept as suggested in MASS, ch. 6 logit(y) ~ geology/temp - 1 .. the logit of the response is based on the influence of temperature, which is different (slope and intercept are different) across each level of the geology. The output from that linear model gives (I think) an intercept and slope for each level of my stratifying factor; a single regression on temp for every level of geology (?). Interpretation of the model parameters would potentially be more meaningful in my context. Here is the output, this time without any deletions: glm(formula = mollic ~ geo_class2/bd_sum - 1, family = binomial(), data = xy[-crit.points.idx, ], subset = parent_material != "ALL" & parent_material != "OBD" & dem_slope >= slope_cutoff) Deviance Residuals: Min 1Q Median 3Q Max -1.9818343 -0.7562096 0.0003575 0.8130413 2.1977756 Coefficients: Estimate Std. Error z value Pr(>|z|) geo_class2andesite 1.146e+01 6.920e+00 1.656 0.09781 . geo_class2clastic_volcanic 3.841e+00 2.233e+00 1.720 0.08539 . geo_class2coarse_sedimentary 4.371e+00 2.390e+00 1.829 0.06743 . geo_class2felsic_intrusive 5.384e+00 2.714e+00 1.984 0.04725 * geo_class2fine_sedimentary 8.527e+00 4.013e+00 2.125 0.03361 * geo_class2rhyolite 1.075e+01 4.071e+00 2.641 0.00828 ** geo_class2tuff 1.657e+01 9.116e+03 0.002 0.99855 geo_class2andesite:bd_sum -1.356e-03 8.448e-04 -1.605 0.10858 geo_class2clastic_volcanic:bd_sum -4.138e-04 2.831e-04 -1.462 0.14383 geo_class2coarse_sedimentary:bd_sum -8.041e-04 3.338e-04 -2.409 0.01601 * geo_class2felsic_intrusive:bd_sum -6.352e-04 3.503e-04 -1.813 0.06977 . geo_class2fine_sedimentary:bd_sum -1.122e-03 5.139e-04 -2.184 0.02897 * geo_class2rhyolite:bd_sum -1.361e-03 4.873e-04 -2.793 0.00522 ** geo_class2tuff:bd_sum 8.793e-15 1.200e+00 7.33e-15 1.00000 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 274.49 on 198 degrees of freedom Residual deviance: 190.03 on 184 degrees of freedom AIC: 218.03 ... how would I interpret the marginal t-tests in this case? Would each be a test of how well the intercept/slope pair are predicting my response? A quick comparison of the two models (again, as suggested in MASS) with the anova() function: Analysis of Deviance Table Model 1: mollic ~ bd_sum + geo_class2 Model 2: mollic ~ geo_class2/bd_sum - 1 Resid. Df Resid. Dev Df Deviance 1 190 194.801 2 184 190.033 6 4.769 ... it does not appear that the second model is any different - is this the correct interpretation? The fitted values are also very close, but not the same. Thanks again, Dylan > > > Thanks, > > > > Dylan > > > > > > > >> -p > >> > >> > >>> Much of what I am trying to get across is that you have a lot of choice > >>> as to how you specify a model to R. There has to be a default, which is > >>> chosen because it is often a good choice. It does rely on factors being > >>> coded well: the 'base level' (to quote ?contr.treatment) needs to be > >>> interpretable. And also bear in mind that the default choices of > >>> statistical software in this area almost all differ (and R's differs from > >>> S, GLIM, some ways to do this in SAS ...), so people's ideas of a 'good > >>> choice' do differ. > >>> > >>> > >>>> Setting up the model, minus the intercept term, gives me what appear to > >>>> be more meaningful coefficients. However, I am not sure if I am > >>>> interpreting the results from a linear model without an intercept term. > >>>> Model predictions from both specifications (with and without an > >>>> intercept term) are nearly identical (different by about 1E-16 in > >>>> probability space). > >>>> > >>>> Are there any gotchas to look out for when removing the intercept term > >>>> from such a model? > >>>> > >>> It is just a different parametrization of the linear predictor. > >>> Anything interpretable in terms of the predictions of the model will be > >>> unchanged. That is the crux: the default coefficients of 'b' will be > >>> log odds-ratios that are directly interpretable, and those in the > >>> per-group coding will be log-odds for a zero value of 'a'. Does a zero > >>> value of 'a' make sense? > >>> > >>> > >>>> Any guidance would be greatly appreciated. > >>>> > >>>> Cheers, > >>>> > > > > > > > > > > > -- > > O__ ---- Peter Dalgaard Ă˜ster Farimagsgade 5, Entr.B > c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K > (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 > ~~~~~~~~~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.