Hi all, Thanks Dennis, do greatly appreciate the detailed reply. That does answer a significant part of what I was getting at.
Unfortunately, I did not ask my question very well. Apologize. Suppose I run > dat <- data.frame(a = factor(rep(letters[1:3], each = 4)), + b = factor(rep(rep(1:2, each = 2), 3)), + c = rnorm(12), d=rnorm(12)) > dat$e<-0.999*dat$d I am (intentionally) making e "almost" equal to d to trigger the rank condition. Subsequently, I run > summary(lm(c~a+b+d+e, data=dat)) Call: lm(formula = c ~ a + b + d + e, data = dat) Residuals: Min 1Q Median 3Q Max -1.0726 -0.5578 0.1132 0.4296 1.3397 Coefficients: (1 not defined because of singularities) Estimate Std. Error t value Pr(>|t|) (Intercept) 0.2564 0.5288 0.485 0.643 ab 1.0253 0.7234 1.417 0.199 ac -0.5434 0.6869 -0.791 0.455 b2 -0.5521 0.5513 -1.002 0.350 d -0.4098 0.2851 -1.438 0.194 e NA NA NA NA Residual standard error: 0.91 on 7 degrees of freedom Multiple R-squared: 0.5177, Adjusted R-squared: 0.2421 F-statistic: 1.878 on 4 and 7 DF, p-value: 0.2191 R appropriately drops e from the design matrix to make X'X nonsingular. I am wondering how R decides (what is the convention) whether to drop d, or to drop e. Are terms added from left to right of the formula, and hence d retained and e dropped? More generally, if presented with a non-singular design matrix with column names (x1, x2, x3, x4, x5), what algorithm is used by R to figure out which columns to drop. Eg: suppose the design matrix with five columns (x1, x2, x3, x4, x5) has rank 3, and x3 = a1x1 + a2x2, and x5 = a4x4. Which 2 columns would R try to drop? I am mostly asking these questions because I am in the process of working on a package. I want to make sure that my treatment of singular design matrices is, as far as possible, compatible with lm. Thanks again! Best, Anirban On Wed, Jul 21, 2010 at 1:41 PM, Dennis Murphy <djmu...@gmail.com> wrote: > Hi: > > (1) lm() drops columns in a rank-deficient model matrix X to make X'X > nonsingular - this is called a full-rank reparameterization of the linear > model. > > (2) How many columns of X are dropped depends on its rank, which in turn > depends on the number of constraints in the model matrix. This is related to > the degrees of freedom associated with each term in the corresponding linear > model.. The default contrasts are >> options()$contrasts > unordered ordered > "contr.treatment" "contr.poly" > > Other choices include contr.helmert() and contr.sum(). See the help page > ?contrasts for further details. See also section 6.2 of Venables and > Ripley's _Modern Applied Statistics with S, 4th ed._ for further information > on the connection between the columns of the model matrix in ANOVA and the > defined sets of contrasts. > > Under the default contrasts, the first column is dropped for the main effect > terms. Here's a simple example of a balanced 2-way ANOVA with interaction: > > d <- data.frame(a = factor(rep(letters[1:3], each = 4)), > b = factor(rep(rep(1:2, each = 2), 3)), > c = rnorm(12)) >> d > a b c > 1 a 1 -0.77367688 > 2 a 1 -0.79069791 > 3 a 2 0.69257133 > 4 a 2 2.46788204 > 5 b 1 0.38892289 > 6 b 1 -0.03521033 > 7 b 2 -0.01071611 > 8 b 2 -0.74209425 > 9 c 1 1.36974281 > 10 c 1 -1.22775441 > 11 c 2 0.29621976 > 12 c 2 0.28208192 > m <- aov(c ~ a * b, data = d) > model.matrix(m) > (Intercept) ab ac b2 ab:b2 ac:b2 > 1 1 0 0 0 0 0 > 2 1 0 0 0 0 0 > 3 1 0 0 1 0 0 > 4 1 0 0 1 0 0 > 5 1 1 0 0 0 0 > 6 1 1 0 0 0 0 > 7 1 1 0 1 1 0 > 8 1 1 0 1 1 0 > 9 1 0 1 0 0 0 > 10 1 0 1 0 0 0 > 11 1 0 1 1 0 1 > 12 1 0 1 1 0 1 > attr(,"assign") > [1] 0 1 1 2 3 3 > attr(,"contrasts") > attr(,"contrasts")$a > [1] "contr.treatment" > > attr(,"contrasts")$b > [1] "contr.treatment" > > Notice that the first column of each main effect is dropped, and that the > interaction columns are the products of the retained a columns with the > retained b columns. The attr() components verify that the contrast method > used for this matrix is the default contr.treatment. > >> anova(m) > Analysis of Variance Table > > Response: c > Df Sum Sq Mean Sq F value Pr(>F) > a 2 0.5001 0.25003 0.2827 0.7633 > b 1 1.3700 1.36999 1.5489 0.2597 > a:b 2 4.5647 2.28235 2.5804 0.1554 > Residuals 6 5.3070 0.88450 > > Examination of the degrees of freedom tells us that there are two > independent contrasts for a, one independent contrast for b and two > independent contrasts for the interaction of a and b, which are shown in > model.matrix(m) above. > > If you want to reset the baselline factor level, see ?relevel. Also look > into the contrast package on CRAN. > > HTH, > Dennis > > On Wed, Jul 21, 2010 at 8:40 AM, Anirban Mukherjee <am...@cornell.edu> > wrote: >> >> Hi all, >> >> If presented with a singular design matrix, lm drops columns to make the >> design matrix non-singular. What algorithm is used to select which (and >> how >> many) column(s) to drop? Particularly, given a factor, how does lm choose >> levels of the factor to discard? >> >> Thanks for the help. >> >> Best, >> Anirban >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.