Re: [R] parameterization of glm nested design

Charles C. Berry Fri, 02 Jul 2010 08:52:02 -0700


Comments inline, below.


On Thu, 1 Jul 2010, Huso, Manuela wrote:

Chuck,

Thank you for welcoming me to the list and thank you for taking the time to 
address my question.  Pointing me to the pivot and qr components of my model 
object was very useful. But I still don't understand how R determines its 
pivot, i.e. the ordering of the coefficients.  My responses are below:

-----Original Message-----
From: Charles C. Berry [mailto:cbe...@tajo.ucsd.edu]
Sent: Thursday, July 01, 2010 9:30 AM
To: Huso, Manuela
Cc: 'r-help@r-project.org'
Subject: Re: [R] parameterization of glm nested design

On Wed, 30 Jun 2010, Huso, Manuela wrote:

Dear R community,

I am new to R, a reforming SAS user :)


Welcome aboard!
[MH: ] Thank you :-)

I am running R 2.10.1 on a Windows XP machine.  I would like to write
linear functions of my coefficient parameter estimates from a glm, but
am having a difficult time understanding the parameterization R uses.
In the toy example below I am running a glm on binomial data, with
clones and lines within clones as fixed effects, each with 6 replicates.
I cannot figure out the algorithm R uses for determining which
combination of levels is used as the reference.  Crawley, in the chapter
on Statistical Modeling in the R Book states "The writers of R agree
that treatment contrasts represent the best solution.  This method does
away with parameter a, the overall mean. The mean of the factor level
that comes first in the alphabet (control, in our example) is promoted
to pole position, and the other effects are shown as differences
(contrasts) between this mean and the other four factor level means."

This pattern seems to hold for full factorials, but it doesn't appear to
work with this example in which Line is nested within Clone.  In this
example, R appears to use the last line in alphabet order of the first
clone (Clone 1) as the intercept.


Not in my locale:
[MH: ] These are the same in mine...

sort(levels(tmp$Cl))

[1] "1" "2" "3" "4"


sort(levels(tmp$L))

[1] "1"    "104"  "116"  "14"   "84-1" "9"    "91"   "96"


cat(names(coef(tmp.glm)),fill=50)

(Intercept) Cl2 Cl3 Cl4 Cl1:L104 Cl2:L104
Cl3:L104 Cl4:L104 Cl1:L116 Cl2:L116 Cl3:L116
Cl4:L116 Cl1:L14 Cl2:L14 Cl3:L14 Cl4:L14
Cl1:L84-1 Cl2:L84-1 Cl3:L84-1 Cl4:L84-1 Cl1:L9
Cl2:L9 Cl3:L9 Cl4:L9 Cl1:L91 Cl2:L91 Cl3:L91
Cl4:L91 Cl1:L96 Cl2:L96 Cl3:L96 Cl4:L96


# list any terms ending in Cl1 or L1
grep("Cl1$|L1$",names(coef(tmp.glm)))

integer(0)

# so those terms were dropped!!

Cl1 is dropped as are Cl[1234]*:L1. [MH: ] Yes, however, this is anested design, so Line 1 doesn't even occur in Clones 1,2 or 3, so itcan't *really* be dropped. When I list the coefficient estimates forall of these 32 combinations of 4 Clones and 8 Lines, several arenecessarily NA because of the nesting, i.e. those combinations do notexist Others are NA because of how R has chosen to parameterize themodel. So, while in principle, line 1 was dropped from all clones, itwas relevant only for Clone 4, because line 1 occurs only in Clone 4.

Well, I didn't bring this up before, but I have some doubts about thisbeing a 'nested' design.

The replicates in a nested design are thought to be exchangeable, but yourchoice of labels like 'Line 84-1' suggest otherwise. Also, in a trulynested design the analyst should not care which replicate in a collectionof nested replicates is used a s the reference level as the coefficientsusually are not interpreted.


Rather it looks like you have an incompletely crossed design.

coef(tmp.glm)
(Intercept)         Cl2         Cl3         Cl4
-0.8472979  -0.4744580   0.3007542   0.7455152
  Cl1:L104    Cl2:L104    Cl3:L104    Cl4:L104
 0.5108256          NA          NA          NA
  Cl1:L116    Cl2:L116    Cl3:L116    Cl4:L116
        NA          NA          NA  -1.0878014
   Cl1:L14     Cl2:L14     Cl3:L14     Cl4:L14
        NA   1.1882244  -0.3814431          NA
 Cl1:L84-1   Cl2:L84-1   Cl3:L84-1   Cl4:L84-1
        NA          NA          NA  -0.5172565
    Cl1:L9      Cl2:L9      Cl3:L9      Cl4:L9
        NA   0.8463321          NA          NA
   Cl1:L91     Cl2:L91     Cl3:L91     Cl4:L91
        NA          NA  -0.2225894          NA
   Cl1:L96     Cl2:L96     Cl3:L96     Cl4:L96
        NA          NA          NA          NA

In order to achieve full rank in this nested design, other lines within Clones 
had to be
dropped as well, i.e. R had to choose a particular parameterization.  The 
possibilities of what to drop are limited to lines that exist within each 
clone.  Here are the possible combinations of clones and lines:

with(tmp,table(L,Cl))
     Cl
L      1 2 3 4
 1    0 0 0 6
 104  6 0 0 0
 116  0 0 6 0
 14   0 6 0 6
 84-1 0 0 6 0
 9    0 0 0 6
 91   6 6 0 0
 96   0 6 6 0
(The 6 represents the 6 replicates of each combination.)

In the model, a coefficient for Cl1:L104 is estimated, but not for any other 
line in Cl1.  Because Cl1 has only one other line within it (L91), this line 
within Cl1 was used as the intercept, with all other coefficients representing 
differences with respect to this reference.  Using this parameterization I 
understand that the
estimated log(odds) of Cl1:L91 = Intercept = -0.8472979
estimated log(odds) of Cl1:L104 = Intercept + Cl1:L104 = -0.8472979 + 0.5108256

I checked this to be sure, and indeed the log odds of Cl1:L91 is the intercept.
Cl1<-subset(tmp, Cl=="1")
Cl1.c <- log(mean(Cl1$rate[Cl1$L=="91"])/(1-mean(Cl1$rate[Cl1$L=="91"])))
Cl1.c
[1] -0.8472979

What this implies is that R has used the *last* level of line within Cl1as the 'reference' or what Crawley calls the 'pole position'. In factit has used the last line within every clone as the 'reference' EXCEPTfor Cl4, the clone in which the first line (L1) appears. (I've alsoswitched L1 to appear in Cl3 rather than Cl4 and R still uses the *last*line within each clone, except for the clone in which the first line(L1) appears.)

You need to separate the construction of the model.matrix() from what Rdoes with it algebraically.

The model.matrix() contains terms that can only be estimated up to alinear constraint. qr() handles this automagically. If you want to betterunderstand how, you need to go through the sources (see ref below).

If you want to better control how R handles designs that are of less thanfull rank, you have contrasts()<- (please read ?contr.treatment) to modifywhich category is treated as the base.

And you have the option of constructing factors in your data.frame (or inthe formula itself) that have reduced levels() and using them in a formulato more precisely control construction of the model.matrix.

Thereafter, the reference levels for
Clones are the last lines of Clones 2 and 3, but the first line (Line 1)
of Clone 4.  The first line of Clone 4, is also the first line over all
lines.  Can someone please explain what process R uses to determine the
parameterization of this model?


Huh?
[MH: ] See above...

AFAICS, it dropped the first level in sort(levels(...)) of Cl and the
first level of L in the Cl:L terms as noted above.
[MH: ] Your grep produces a 0 because L1 doesn't occur in any other clones than 
Cl1.

And if you want something different, "contrasts()<-" is your friend.
[MH: ] Yes, I can set contrasts.  In this case I am using R's default contrast 
for an unordered factor, i.e. treatment contrasts.  But treatment contrasts are 
not producing the results I would expect, given the help files I have been able 
to find. I am interested in understanding R's logic.
See
        ?contrasts

and the See Also's there.


Perhaps you are asking why the pivoting dropped some terms and not
others??
[MH: ] Yes, I think I am... thank you for pointing me to pivots and qr.  I 
guess I was asking how R determines how to order the pivot.

If so, you'll need to look at tmp.glm$qr$pivot or just
qr(model.matrix(~Cl/L,tmp))$pivot and dig through the qr() source codes to
follow the algebra.
[MH: ] I would if I knew how... when I type qr without parentheses, I see:
qr
function (x, ...)
UseMethod("qr")
<environment: namespace:base>
Then I'm stuck.



Read

        An Intro to R Section 10.9 Classes, generic functions and object
        orientation

and

        Uwe Ligges. R Help Desk: Accessing the sources. R News, 6(4):43-45,
        October 2006


You do know that the design is not of full rank, right?

[MH: ] Well, yes. It is a nested design. In fact any design matrixwith an indicator variable for each level of a factor variable that*also* includes the intercept will not be of full rank, hence thedifferent choices in parameterization (STAT651).

Right. But there is a convention in an R formula like 'y ~ my.factor' thatcontrasts are used, and under that convention the design constructed bymodel.matrix() need not be of reduced rank.

I'm confused by what the literature says R does in default treatmentcontrasts (chooses to drop the factor level that is lowestalphabetically) vs what it appears R does in a more complicated casethan a full factorial.


Answered far above, I think in the model.matrix() vs qr() business.

You might find it instructive to use qr() on your model.matrix() andon subsets of columns of it.

But if you want control over the ultimate model.matrix used by glm, youneed to construct factors with reduced levels() and a suitable formula (oras a last resort build up the columns of the design you want and placethem in the data.frame, but that is truly a last resort).



Chuck.


HTH,
[MH: ] Yes, it does!  Many thanks.

Chuck

p.s. Thanks for following the posting guide's dictum to include a
self-contained example.

p.p.s. Here is my locale:
[MH: ] Mine is the same...

sessionInfo()$locale

[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United
States.1252;LC_MONETARY=English_United
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Of course, I know what the parameterization is in this model and can
write the functions I need.  However, I am trying to understand the
algorithm R uses to determine the parameterization.  This is only a toy
example of a much larger study and I would like to know when I should
use the last level as the reference and when I should use the first.

Many thanks,
Manuela

tmp <- data.frame(Cl=rep(1:4,c(12,18,18,18)),
                  L =rep(c(91,104,14,91,96,"84-1",96,116,1,9,14),each=6),
                  N =rep(10,(12+18*3)),
                  D =c(1,6,8,1,1,1,2,6,10,3,3,1,2,1,1,0,4,4,6,5,3,5,1,3,
                       1,6,8,5,5,3,5,2,1,0,4,5,1,0,2,3,6,7,4,0,2,5,3,8,1,
                       4,7,0,6,3,7,2,3,6,1,9,7,2,1,3,0,1) )
tmp$N[c(13,15,59)] <- c(8,9,9)  # not always 10 trials
tmp$Cl <- factor(tmp$Cl)            # Make sure clone and line within clone are 
factors
tmp$L  <- factor(tmp$L)

model <- formula(cbind(D,(N-D))~ Cl/L)
tmp.glm<-glm(data=tmp,formula=model, family=quasibinomial(link="logit"))

with(tmp,table(L,Cl))
summary.glm(tmp.glm)$coefficients

unique(tmp[order(tmp$L),1:2])
# The coeff estimates R gives are in this order,
#   but R is using the rows c(1,8,10,11) as reference
# Why?

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu               UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Charles C. Berry                            (858) 534-2098
                                            Dept of Family/Preventive Medicine
E mailto:cbe...@tajo.ucsd.edu               UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] parameterization of glm nested design

Reply via email to