You are correct that you need to use ~-1+A:B. I use that all the time and just spaced it out when I was writing
the response.  Using ~A:B will produce one too many columns.

Didn't follow your other question. You can always look at the result of model.matrix to see if it is correct or dig into the
internal code.  Personally, I've never found it to have a problem.

--jeff

On 3/5/2010 10:16 AM, blue sky wrote:
On Fri, Mar 5, 2010 at 11:41 AM, Jeff Laake<jeff.la...@noaa.gov>  wrote:
On 3/5/2010 9:19 AM, Frank E Harrell Jr wrote:
You neglected to state your name and affiliation, and your question
demonstrates an allergy to R documentation.

Frank
I agree with Frank but will try to answer some of your questions as I
understand it.

First, model.matrix uses the options$contrasts or what is set specifically
as the contrast for a factor using contrasts().  The default is
treatment contrasts and for a factor that means the first level of a factor
variable is the intercept and the remainder are "treatment" effects
which are added to intercept.  If you specify  Y~A+B+A:B, you are specifying
the model with main effects (A, B) and interactions (A:B).
So my interpretation of how R does internally when dealing with
interaction terms A:B is to look if each individual term (each sub
interactions in highway interactions like, A:B, A:C, etc., in A:B:C:D)
appears in the formula or not, and deciding how to construct the
columns of the model matrix according to A:B, right?

This seem quite complicated when a formula is arbitrarily complex, let
along different type of data (namely ordered, not ordered, numerical
numbers). The piece of information that I feel that is still missing
to me is the detailed procedure that R generate model.matrix for
arbitrary complex formulas and how it is proved the way that R does is
always correct for these complex formulas.

If A has m levels and B has n levels then there will be an intercept (1),
m-1 for A, n-1 for B and (m-1)*(n-1) for the interaction.
If you specify the model as Y~A:B it will specify the model fully with
interactions which will have m*n separate parameters and none will be
NA as long as you have data in each of the m*n cells.  It makes absolutely
no sense to me to have an intercept and then all but one of the
remaining interactions included.
I think that 'Y ~ A:B' indeed has the intercept term unless '-1' is
included. You may need to adjust your interpretation of A:B and A:B -1
in these cases.
dim(my_subset(model.matrix(Y ~ A:B - 1,fr)))
[1] 12 12
dim(my_subset(model.matrix(Y ~ A:B,fr)))
[1] 12 13


Note that you'll get quite different results if A and/or B are not factor
variables.

--jeff
blue sky wrote:
The following is the code for the model.matrix. But it still doesn't
answer why A:B is interpreted differently in Y~A+B+A:B and Y~A:B. By
'why', I mean how R internally does it and what is the rational behind
the way of doing it?

And it didn't answer why in the model.matrix of Y~A, there are a-1
terms from A plus the intercept, but in the model.matrix of Y~A:B,
there are a*b terms (rather than a*b-1 terms) plus the intercept.
Since the one coefficient of the lm of Y~A:B is going to be NA, why
bother to include the corresponding term in the model matrix?

############code below

set.seed(0)

a=3
b=4

AB_effect=data.frame(
  name=paste(
    unlist(
      do.call(
        rbind
        , rep(list(paste('A', letters[1:a],sep='')), b)
        )
      )
    , unlist(
      do.call(
        cbind
        , rep(list(paste('B', letters[1:b],sep='')), a)
        )
      )
    , sep=':'
    )
  , value=rnorm(a*b)
  , stringsAsFactors=F
  )

max_n=10
n=sample.int(max_n, a*b, replace=T)

AB=mapply(function(name, n){rep(name,n)}, AB_effect$name, n)

Y=AB_effect$value[match(unlist(AB), AB_effect$name)]

Y=Y+a*b*rnorm(length(Y))

sub_fr=as.data.frame(do.call(rbind, strsplit(unlist(AB), ':')))
rownames(sub_fr)=NULL
colnames(sub_fr)=c('A', 'B')

fr=data.frame(Y=Y,sub_fr)

my_subset=function(amm) {
  coding=apply(
    amm
    ,1
    , function(x) {
      paste(x, collapse='')
    }
    )
  amm[match(unique(coding), coding),]
}

my_subset(model.matrix(Y ~ A*B,fr))
my_subset(model.matrix(Y ~ (A+B)^2,fr))
my_subset(model.matrix(Y ~ A + B + A:B,fr))
my_subset(model.matrix(Y ~ A:B - 1,fr))
my_subset(model.matrix(Y ~ A:B,fr))

On Fri, Mar 5, 2010 at 8:45 AM, Gabor Grothendieck
<ggrothendi...@gmail.com>  wrote:
The way to understand this is to look at the output of model.matrix:

model.matrix(fo, fr)

for each formula you tried.  If your data is large you will have to
use a subset not to be overwhelmed with output.

On Fri, Mar 5, 2010 at 9:08 AM, blue sky<bluesky...@gmail.com>  wrote:
Suppose, 'fr' is data.frame with columns 'Y', 'A' and 'B'. 'A' has
levels 'Aa'
'Ab' and 'Ac', and 'B' has levels 'Ba', 'Bb', 'Bc' and 'Bd'. 'Y'
columns are numbers.

I tried the following three sets of commands. I understand that A*B is
equivalent to A+B+A:B. However, A:B in A+B+A:B is different from A:B
just by itself (see the 3rd and 4th set of commands). Would you please
help me understand why the meanings of A:B are different in different
contexts?

I also see the coefficient of AAc:BBd is NA (the last set of
commands). I'm wondering why this coefficient is not removed from the
'coefficients' vector. Since lm(Y~A) has coefficients for (intercept),
Ab, Ac (there are no NA's), I think that it is reasonable to make sure
that there are no NA's when there are interaction terms, namely, A:B
in this case.

Thank you for answering my questions!

alm=lm(Y ~ A*B,fr)
alm$coefficients
(Intercept)         AAb         AAc         BBb         BBc         BBd
  -3.548176   -2.086586    7.003857    4.367800   11.887356   -3.470840
  AAb:BBb     AAc:BBb     AAb:BBc     AAc:BBc     AAb:BBd     AAc:BBd
  5.160865  -11.858425  -12.853116  -20.289611    6.727401   -2.327173
alm=lm(Y ~ A + B + A:B,fr)
alm$coefficients
(Intercept)         AAb         AAc         BBb         BBc         BBd
  -3.548176   -2.086586    7.003857    4.367800   11.887356   -3.470840
  AAb:BBb     AAc:BBb     AAb:BBc     AAc:BBc     AAb:BBd     AAc:BBd
  5.160865  -11.858425  -12.853116  -20.289611    6.727401   -2.327173
alm=lm(Y ~ A:B - 1,fr)
alm$coefficients
  AAa:BBa    AAb:BBa    AAc:BBa    AAa:BBb    AAb:BBb    AAc:BBb
  AAa:BBc
-3.5481765 -5.6347625  3.4556808  0.8196231  3.8939016 -4.0349449
  8.3391795
  AAb:BBc    AAc:BBc    AAa:BBd    AAb:BBd    AAc:BBd
-6.6005222 -4.9465744 -7.0190168 -2.3782017 -2.3423322
alm=lm(Y ~ A:B,fr)
alm$coefficients
(Intercept)     AAa:BBa     AAb:BBa     AAc:BBa     AAa:BBb     AAb:BBb
-2.34233221 -1.20584424 -3.29243033  5.79801305  3.16195534  6.23623377
  AAc:BBb     AAa:BBc     AAb:BBc     AAc:BBc     AAa:BBd     AAb:BBd
-1.69261273 10.68151168 -4.25819000 -2.60424217 -4.67668454 -0.03586951
  AAc:BBd
       NA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to