You are correct that you need to use ~-1+A:B. I use that all the time
and just spaced it out when I was writing
the response. Using ~A:B will produce one too many columns.
Didn't follow your other question. You can always look at the result of
model.matrix to see if it is correct or dig into the
internal code. Personally, I've never found it to have a problem.
--jeff
On 3/5/2010 10:16 AM, blue sky wrote:
On Fri, Mar 5, 2010 at 11:41 AM, Jeff Laake<jeff.la...@noaa.gov> wrote:
On 3/5/2010 9:19 AM, Frank E Harrell Jr wrote:
You neglected to state your name and affiliation, and your question
demonstrates an allergy to R documentation.
Frank
I agree with Frank but will try to answer some of your questions as I
understand it.
First, model.matrix uses the options$contrasts or what is set specifically
as the contrast for a factor using contrasts(). The default is
treatment contrasts and for a factor that means the first level of a factor
variable is the intercept and the remainder are "treatment" effects
which are added to intercept. If you specify Y~A+B+A:B, you are specifying
the model with main effects (A, B) and interactions (A:B).
So my interpretation of how R does internally when dealing with
interaction terms A:B is to look if each individual term (each sub
interactions in highway interactions like, A:B, A:C, etc., in A:B:C:D)
appears in the formula or not, and deciding how to construct the
columns of the model matrix according to A:B, right?
This seem quite complicated when a formula is arbitrarily complex, let
along different type of data (namely ordered, not ordered, numerical
numbers). The piece of information that I feel that is still missing
to me is the detailed procedure that R generate model.matrix for
arbitrary complex formulas and how it is proved the way that R does is
always correct for these complex formulas.
If A has m levels and B has n levels then there will be an intercept (1),
m-1 for A, n-1 for B and (m-1)*(n-1) for the interaction.
If you specify the model as Y~A:B it will specify the model fully with
interactions which will have m*n separate parameters and none will be
NA as long as you have data in each of the m*n cells. It makes absolutely
no sense to me to have an intercept and then all but one of the
remaining interactions included.
I think that 'Y ~ A:B' indeed has the intercept term unless '-1' is
included. You may need to adjust your interpretation of A:B and A:B -1
in these cases.
dim(my_subset(model.matrix(Y ~ A:B - 1,fr)))
[1] 12 12
dim(my_subset(model.matrix(Y ~ A:B,fr)))
[1] 12 13
Note that you'll get quite different results if A and/or B are not factor
variables.
--jeff
blue sky wrote:
The following is the code for the model.matrix. But it still doesn't
answer why A:B is interpreted differently in Y~A+B+A:B and Y~A:B. By
'why', I mean how R internally does it and what is the rational behind
the way of doing it?
And it didn't answer why in the model.matrix of Y~A, there are a-1
terms from A plus the intercept, but in the model.matrix of Y~A:B,
there are a*b terms (rather than a*b-1 terms) plus the intercept.
Since the one coefficient of the lm of Y~A:B is going to be NA, why
bother to include the corresponding term in the model matrix?
############code below
set.seed(0)
a=3
b=4
AB_effect=data.frame(
name=paste(
unlist(
do.call(
rbind
, rep(list(paste('A', letters[1:a],sep='')), b)
)
)
, unlist(
do.call(
cbind
, rep(list(paste('B', letters[1:b],sep='')), a)
)
)
, sep=':'
)
, value=rnorm(a*b)
, stringsAsFactors=F
)
max_n=10
n=sample.int(max_n, a*b, replace=T)
AB=mapply(function(name, n){rep(name,n)}, AB_effect$name, n)
Y=AB_effect$value[match(unlist(AB), AB_effect$name)]
Y=Y+a*b*rnorm(length(Y))
sub_fr=as.data.frame(do.call(rbind, strsplit(unlist(AB), ':')))
rownames(sub_fr)=NULL
colnames(sub_fr)=c('A', 'B')
fr=data.frame(Y=Y,sub_fr)
my_subset=function(amm) {
coding=apply(
amm
,1
, function(x) {
paste(x, collapse='')
}
)
amm[match(unique(coding), coding),]
}
my_subset(model.matrix(Y ~ A*B,fr))
my_subset(model.matrix(Y ~ (A+B)^2,fr))
my_subset(model.matrix(Y ~ A + B + A:B,fr))
my_subset(model.matrix(Y ~ A:B - 1,fr))
my_subset(model.matrix(Y ~ A:B,fr))
On Fri, Mar 5, 2010 at 8:45 AM, Gabor Grothendieck
<ggrothendi...@gmail.com> wrote:
The way to understand this is to look at the output of model.matrix:
model.matrix(fo, fr)
for each formula you tried. If your data is large you will have to
use a subset not to be overwhelmed with output.
On Fri, Mar 5, 2010 at 9:08 AM, blue sky<bluesky...@gmail.com> wrote:
Suppose, 'fr' is data.frame with columns 'Y', 'A' and 'B'. 'A' has
levels 'Aa'
'Ab' and 'Ac', and 'B' has levels 'Ba', 'Bb', 'Bc' and 'Bd'. 'Y'
columns are numbers.
I tried the following three sets of commands. I understand that A*B is
equivalent to A+B+A:B. However, A:B in A+B+A:B is different from A:B
just by itself (see the 3rd and 4th set of commands). Would you please
help me understand why the meanings of A:B are different in different
contexts?
I also see the coefficient of AAc:BBd is NA (the last set of
commands). I'm wondering why this coefficient is not removed from the
'coefficients' vector. Since lm(Y~A) has coefficients for (intercept),
Ab, Ac (there are no NA's), I think that it is reasonable to make sure
that there are no NA's when there are interaction terms, namely, A:B
in this case.
Thank you for answering my questions!
alm=lm(Y ~ A*B,fr)
alm$coefficients
(Intercept) AAb AAc BBb BBc BBd
-3.548176 -2.086586 7.003857 4.367800 11.887356 -3.470840
AAb:BBb AAc:BBb AAb:BBc AAc:BBc AAb:BBd AAc:BBd
5.160865 -11.858425 -12.853116 -20.289611 6.727401 -2.327173
alm=lm(Y ~ A + B + A:B,fr)
alm$coefficients
(Intercept) AAb AAc BBb BBc BBd
-3.548176 -2.086586 7.003857 4.367800 11.887356 -3.470840
AAb:BBb AAc:BBb AAb:BBc AAc:BBc AAb:BBd AAc:BBd
5.160865 -11.858425 -12.853116 -20.289611 6.727401 -2.327173
alm=lm(Y ~ A:B - 1,fr)
alm$coefficients
AAa:BBa AAb:BBa AAc:BBa AAa:BBb AAb:BBb AAc:BBb
AAa:BBc
-3.5481765 -5.6347625 3.4556808 0.8196231 3.8939016 -4.0349449
8.3391795
AAb:BBc AAc:BBc AAa:BBd AAb:BBd AAc:BBd
-6.6005222 -4.9465744 -7.0190168 -2.3782017 -2.3423322
alm=lm(Y ~ A:B,fr)
alm$coefficients
(Intercept) AAa:BBa AAb:BBa AAc:BBa AAa:BBb AAb:BBb
-2.34233221 -1.20584424 -3.29243033 5.79801305 3.16195534 6.23623377
AAc:BBb AAa:BBc AAb:BBc AAc:BBc AAa:BBd AAb:BBd
-1.69261273 10.68151168 -4.25819000 -2.60424217 -4.67668454 -0.03586951
AAc:BBd
NA
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.