Hi Dennis, Thanks very much for the details. All those explanations about non-estimable mu_{12} when it has no data make sense to me!
Regarding my specific example data where mu_{12} should NOT be estimable in a linear model with interaction because it has no data, yet the linear model I created by using lm() in R still CAN estimate the mean mu_{12}, while on the other hand, mu_{72} is instead NOT estimable from lm() even this category does have data. Does this contradiction to the theory imply that the linear model by lm() in R on my specific example data is NOT reliable/trustable and should not be used? Thanks John ________________________________ From: Dennis Murphy <djmu...@gmail.com> Cc: "r-help@r-project.org" <r-help@r-project.org> Sent: Tuesday, November 8, 2011 10:22 AM Subject: Re: [R] why NA coefficients The cell mean mu_{12} is non-estimable because it has no data in the cell. How can you estimate something that's not there (at least without imputation :)? Every parametric function that involves mu_{12} will also be non-estimable - in particular, the interaction term and the population marginal means . That's why you get the NA estimates and the warning. All this follows from the linear model theory described in, for example, Milliken and Johnson (1992), Analysis of Messy Data, vol. 1, ch. 13. Here's an example from Milliken and Johnson (1992) to illustrate: B1 B2 B3 T1 2, 6 8, 6 T2 3 14 12, 9 T3 6 9 Assume a cell means model E(Y_{ijk}) = \mu_{ij}, where the cell means are estimated by the cell averages. From M & J (p. 173, whence this example is taken): "Whenever treatment combinations are missing, certain hypotheses cannot be tested without making some additional assumptions about the parameters in the model. Hypotheses involving parameters corresponding to the missing cells generally cannot be tested. For example, for the data [above] it is not possible to estimate any linear combinations (or to test any hypotheses) that involve parameters \mu_{12} and \mu_{33} unless one is willing to make some assumptions about them." They continue: "One common assumption is that there is no interactions between the levels of T and the levels of B. In our opinion, this assumption should not be made without some supporting experimental evidence." In other words, removing the interaction term makes the non-estimability problem disappear, but it's a copout unless there is some tangible scientific justification for an additive rather than an interaction model. For the above data, M & J note that it is not possible to estimate all of the expected marginal means - in particular, one cannot estimate the population marginal means $\bar{\mu}_{1.}$, $\bar{\mu}_{3.}$, $\bar{\mu}_{.2}$ or $\bar{\mu}_{.3}$. OTOH, $\bar{\mu}_{2.}$ and $\bar{\mu}_{.1}$ since these functions of the parameters involve terms associated with the means of the missing cells. Moreover, any hypotheses involving parametric functions that contain non-estimable cell means are not testable. In this example, the test of equal row population marginal means is not testable because $\bar{\mu}_{1.}$ and $\bar{\mu}_{3.}$ are not estimable. [Aside: if the term parametric function is not familiar, in this context it refers to linear combinations of model parameters. In the M & J example, $\bar{\mu}_{1.} = \mu_{11} + \mu_{12} + \mu_{13}$ is a parametric function.] Hopefully this sheds some light on the situation. Dennis > Hi Dennis, > The cell mean mu_12 from the model involves the intercept and factor 2: > Coefficients: > (Intercept) factor(treat)2 > factor(treat)3 > 0.429244 > 0.499982 0.352971 > factor(treat)4 factor(treat)5 > factor(treat)6 > -0.204752 > 0.142042 0.044155 > factor(treat)7 factor(group)2 > factor(treat)2:factor(group)2 > -0.007775 > -0.337907 -0.208734 > factor(treat)3:factor(group)2 factor(treat)4:factor(group)2 > factor(treat)5:factor(group)2 > -0.195138 > 0.800029 0.227514 > factor(treat)6:factor(group)2 factor(treat)7:factor(group)2 > 0.331548 NA > So mu_12 = 0.429244-0.337907 = 0.091337. This can be verified by: >> predict(fit,data.frame(list(treat=1,group=2))) > 1 > 0.09133691 > Warning message: > In predict.lm(fit, data.frame(list(treat = 1, group = 2))) : > prediction from a rank-deficient fit may be misleading > > But as you can see, it gave a warning about rank-deficient fit... why this > is a rank-deficient fit? > Because "treat 1_group 2" has no cases, so why it is still estimable while > on the contrary, "treat 7_group 2" which has 2 cases is not? > Thanks > John > > > > > ________________________________ > From: Dennis Murphy <djmu...@gmail.com> > Sent: Monday, November 7, 2011 9:29 PM > Subject: Re: [R] why NA coefficients > > Hi John: > > What is the estimate of the cell mean \mu_{12}? Which model effects > involve that cell mean? With this data arrangement, the expected > population marginal means of treatment 1 and group 2 are not estimable > either, unless you're willing to assume a no-interaction model. > > Chapters 13 and 14 of Milliken and Johnson's Analysis of Messy Data > (vol. 1) cover this topic in some detail, but it assumes you're > familiar with the matrix form of a linear statistical model. Both > chapters cover the two-way model with interaction - Ch.13 from the > cell means model approach and Ch. 14 from the model effects approach. > Because this was written in the mid 80s and republished in the early > 90s, all the code used is in SAS. > > HTH, > Dennis > >> Thanks David. The only category that has no cases is "treat 1-group 2": >> >>> with(test,table(treat,group)) >> group >> treat 1 2 >> 1 8 0 >> 2 1 5 >> 3 5 5 >> 4 7 3 >> 5 7 4 >> 6 3 3 >> 7 8 2 >> >> But why the coefficient for "treat 7-group 2" is not estimable? >> >> Thanks >> >> John >> >> >> >> >> ________________________________ >> From: David Winsemius <dwinsem...@comcast.net> >> >> Cc: "r-help@r-project.org" <r-help@r-project.org> >> Sent: Monday, November 7, 2011 5:13 PM >> Subject: Re: [R] why NA coefficients >> >> >> On Nov 7, 2011, at 7:33 PM, array chip wrote: >> >>> Hi, I am trying to run ANOVA with an interaction term on 2 factors (treat >>> has 7 levels, group has 2 levels). I found the coefficient for the last >>> interaction term is always 0, see attached dataset and the code below: >>> >>>> test<-read.table("test.txt",sep='\t',header=T,row.names=NULL) >>>> lm(y~factor(treat)*factor(group),test) >>> >>> Call: >>> lm(formula = y ~ factor(treat) * factor(group), data = test) >>> >>> Coefficients: >>> (Intercept) factor(treat)2 >>> factor(treat)3 >>> 0.429244 0.499982 >>> 0.352971 >>> factor(treat)4 factor(treat)5 >>> factor(treat)6 >>> -0.204752 0.142042 >>> 0.044155 >>> factor(treat)7 factor(group)2 >>> factor(treat)2:factor(group)2 >>> -0.007775 -0.337907 >>> -0.208734 >>> factor(treat)3:factor(group)2 factor(treat)4:factor(group)2 >>> factor(treat)5:factor(group)2 >>> -0.195138 0.800029 >>> 0.227514 >>> factor(treat)6:factor(group)2 factor(treat)7:factor(group)2 >>> 0.331548 NA >>> >>> >>> I guess this is due to model matrix being singular or collinearity among >>> the matrix columns? But I can't figure out how the matrix is singular in >>> this case? Can someone show me why this is the case? >> >> Because you have no cases in one of the crossed categories. >> >> --David Winsemius, MD >> West Hartford, CT >> [[alternative HTML version deleted]] >> >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> >> > > > [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.