Pardon me for intruding, but I had recently to explain something analogous to this to a distinguished physician who had hurt himself playing with survival models and was bleeding graphs and analyses all over the place...
Dylan Beaudette a écrit : > On Tuesday 13 November 2007, Prof Brian Ripley wrote: >> On Tue, 13 Nov 2007, Dylan Beaudette wrote: >>> Hi, >>> >>> I have setup a simple logistic regression model with the glm() function, >>> with the follow formula: >>> >>> y ~ a + b >>> >>> where: >>> 'a' is a continuous variable stratified by >>> the levels of 'b' >>> >>> >>> Looking over the manual for model specification, it seems that >>> coefficients for unordered factors are given 'against' the first level of >>> that factor. > > Thanks for the quick reply. > >> Only for the default coding. > > Indeed, I should have added that to my initial message. > >>> This makes for difficult interpretation when using factor 'b' as a >>> stratifying model term. >> Really? You realize that you have not 'stratified' on 'b', which would >> need the model to be a*b? What you have is a model with parallel linear >> predictors for different levels of 'b', and if the coefficients are not >> telling you what you want you should change the coding. > > I should have specified that interpretation was difficult, not because of the > default behaviour, rather my limitations and the nature of the data. Perhaps > an example would help. > > y ~ a + b > > 'a' is a continuous predictor (i.e. temperature) > observed on the levels of 'b' (geology) Hmmm... are they fixed (as in "reproductible" or "having well known properties which could be expected to remain stable if the observation/experiment is repeated") or random (i. e. label for whatever variation is thought to be attribuable to local conditions). In other words, are you interested in the effect of a, b being a nuisance (source of variability), or in the effects of b by themselves ? What glm does supposes that you are considering b as a fixed effects. Treating it as a random effect would involve something like Douglas Bates' lme4. Your reference to geology leaves the interpretation open... more on this later. > The form of the model (or at least what I was hoping for) would account for > the variation in 'y' as predicted by 'a', within each level of 'b' . Am I > specifying this model incorrectly? > >> Much of what I am trying to get across is that you have a lot of choice as >> to how you specify a model to R. There has to be a default, which is >> chosen because it is often a good choice. It does rely on factors being >> coded well: the 'base level' (to quote ?contr.treatment) needs to be >> interpretable. And also bear in mind that the default choices of >> statistical software in this area almost all differ (and R's differs from >> S, GLIM, some ways to do this in SAS ...), so people's ideas of a 'good >> choice' do differ. > > Understood. I was not implying a level of 'goodness', rather hoping to gain > some insight into a (possibly) mis-coded model specification. > >>> Setting up the model, minus the intercept term, gives me what appear to >>> be more meaningful coefficients. However, I am not sure if I am >>> interpreting the results from a linear model without an intercept term. >>> Model predictions from both specifications (with and without an intercept >>> term) are nearly identical (different by about 1E-16 in probability >>> space). >>> >>> Are there any gotchas to look out for when removing the intercept term >>> from such a model? >> It is just a different parametrization of the linear predictor. >> Anything interpretable in terms of the predictions of the model will be >> unchanged. That is the crux: the default coefficients of 'b' will be >> log odds-ratios that are directly interpretable, and those in the >> per-group coding will be log-odds for a zero value of 'a'. Does a zero >> value of 'a' make sense? > > In the case of this experiment, a zero-level for 'a' does not make sense. Then, maybe re-scaling a in a' in a way giving some useful physical meaning to "a=0" might give you insights. For example, let's say you are interested in the variations of a biven biological marker with body temperature. wether you express the latter in °C, °K or °F, your statistical inferences about the phenomenon will have the same validity. However, expressing them in (°C-37) give the coefficients a very simple, interesting and useful meaning : they express the behaviour of your marker at physiologically normal body temperature. This reference is indeed useful. Bill Venables expressed this idea (and other very important ones) in a nice conference paper entitled (approximately) "Exegeses on the linear model", whose reference I've lost (again !). In this paper, he suggests that "centering" around the median is often useful (i.e a'<-a-median(a)). > Further thoughts welcomed. Well... a bit on the parametrisation : if you use R's default of "contrast treatment", your coefficient will express differences between the coefficient named and the reference one (the one you don't find a coefficient in the "summary(glm...)". This is easy to interpret. If you use, for example, "contrast.helmert" (which was (still is ?) the default in S-plus), your coefficients are something totally different : they are choosen in a way ensuring that the sum of the coefficients are 0 for each row of recoded data. This choice is popular with some, because they allow for direct calculation of so-called "Type III sum of squares" so refered by the FDA and despised by Bill Venables and Brian Ripley. Other choices are possible ad might be useful in different contexts. Now for "stratification". You wrote your model as "y~a+b". This can be translated as "the (log odds-ratio of the) effect fluctuates around a central value which is the sum of a grand mean (the intercept), a factor specific to the particular geology and a quantity proportional to temperature, this last proportionality factor being the same for *all* geologies". In other words, you get tools allowing you to predict y in various circumstances : an intercept (a "grand mean"), representing the *expectation* for y when a=0 and b is neglected ; a set of corrections for each b (minus the reference, as explained before), allowing you to have a "better" expectation for a=0 ; a coefficient of correction allowing you to refine your expectation when the temperature a is now known. If you wanted to express the idea that this factor might vary between geologies, you would write "y~a*b", which (as MASS will explain you in detail) is equivalent to write "y~a+b+a:b", which can be translated (in poor English) as "the (log odds-ratio of the) effect fluctuates around a central value which is the sum of a grand mean (the intercept), a factor specific to the particular geology, a quantity proportional to temperature (the same for *all* geologies) AND a correction to this proportionality factor specific to each geology times temperature". This will give you : a grand mean, as before, for a=0 and b neglected ; a set of differences between each b and the reference b (for a=0, of course) ; a "gross slope" of the line passing "at best" through the cloud representing your observation in a (a,y) graph,, allowing you to refine your prediction when a is known, but ignoring b and a correction factor for this slope for each b. This is maybe not the easiest way to express this relationship. You might be more interested in each particular slope and not by a "grand slope". This can be expressed by "y~b+(a %in% b)", which is equivalent to "y~a+a:b", which gives you directly a grand mean, a correction for the group and each slope. You could also have no interest in the "grand mean" but only in each particular mean : "y~a*b-1" where the "-1" term specifies to suppress the intercept (the "grand mean"), which will therefore be added to each difference in b, thus giving directly means for a=0. You migh even be interested in getting just a mean and a slope for each group "y~a+a:b-1", which combine the two modifications made before. But note that if yoou need to test for an effect of a or an effect of b, the models where you suppress the "grand mean" or "grand slope" will give you difficulties : the hypotheses you will be testing will be different of what is usually tested. This is well explained in MASS 4 (with more details in the "Linear Statistical Models" chapter). Please also note that I abused the meaning of "prediction" : to be able to predict an expectation is not sufficient, and predict.lm() and such are here for a good reason... ******* A word about "random effects" : I supposed until now that you are indeed interested in the effect of b ("geology") on y because b is some reproductible feature of your terrain (e. g. a particular level of b might be "Hercynian fold", and you expect to extract an information that will remain valid when going to another site with the same characteristics (e. g. learning something in the Alps and using this knowledge in the Pyrenneans). Now it could be something entirely different. To make myself clear, I will return to biology. Suppose that I am interested by the relation between body temperature and a biological marker. I might be interested to know if this relationship varies if my atien is male or female : this is a characteristic I can easily observe if I do again the experiment with another subject (or I should change professions pronto ...), and I can expect that the "sex effect" eventually observed on a random sample of a given population will be re-observed when extracting another sample of the same population. Now, I might also observe that the relation between my marker and body temperature may vary with "familial characteristics", i. e. : 1) this relation varies from people to people, but 2) this variation is less important between people coming from the same family than between unrelated people. The point is that this characteristic (the "family") cannot be *observed* (or *measured*) directly, and is but a *label*. Therefore, if I observe this effect in a set of families choosen at random from a given population, I cannot expect to use this knowledge when studying another set of families extracted from the same population. In other words, I just known that the variability ("distance", in other words) between unrelated people is larger than the variability between related people. It is rather tempting to decompose this variability in two parts : "between families" and "within families", thus "explaining away" a part of this variability and getting better estimates of the relationship of interest (i.e. the temperature-marker relationship). This is similar, but in no ways identical, to the separation between the effects of temperature and effects of sex one should do (ANCOVA) in the previous example : the "family" effect is random, you cannot observe it directly, and you have to estimate the variability it adds to the main efect. That is what "mixed models" aim to do. The way to separate fixed effects (in pour case the temperature-larker relationship) from random effects("between families" variation) from residual variability (exemplified by the "within families" variability) is somewhat intricate. I recommend reading Pinhero and Bates (2000), referenced in the "Books" page of the main R site, and the introduction to the lme4 package. But be aware that this is a bit more intricate than simple (generalized) linear models... In your case, your "geology" factor may or may not be such a label. You have to look at the matter at hand to make a choice. Hope this helps, Emmanuel Charpentier ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.