Dear Abby, > On Aug 30, 2019, at 8:20 PM, Abby Spurdle <spurdl...@gmail.com> wrote: > >> I think that it would be better to handle factors, character predictors, and >> logical predictors consistently. > > "logical predictors" can be regarded as categorical or continuous (i.e. 0 or > 1). > And the model matrix should be the same, either way.
I think that you're mistaking a coincidence for a principle. The coincidence is that FALSE/TRUE coerces to 0/1 and sorts to FALSE, TRUE. Functions like lm() treat logical predictors as factors, *not* as numerical variables. That one would get the same coefficient in either case is a consequence of the coincidence and the fact that the default contrasts for unordered factors are contr.treatment(). For example, if you changed the contrasts option, you'd get a different estimate (though of course a model with the same fit to the data and an equivalent interpretation): ------------ snip -------------- > options(contrasts=c("contr.sum", "contr.poly")) > m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris) > m3 Call: lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data = iris) Coefficients: (Intercept) Sepal.Width I(Species == "setosa")1 2.6672 0.9418 0.8898 > head(model.matrix(m3)) (Intercept) Sepal.Width I(Species == "setosa")1 1 1 3.5 -1 2 1 3.0 -1 3 1 3.2 -1 4 1 3.1 -1 5 1 3.6 -1 6 1 3.9 -1 > tail(model.matrix(m3)) (Intercept) Sepal.Width I(Species == "setosa")1 145 1 3.3 1 146 1 3.0 1 147 1 2.5 1 148 1 3.0 1 149 1 3.4 1 150 1 3.0 1 > lm(Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"), data=iris) Call: lm(formula = Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"), data = iris) Coefficients: (Intercept) Sepal.Width as.numeric(Species == "setosa") 3.5571 0.9418 -1.7797 > -2*coef(m3)[3] I(Species == "setosa")1 -1.779657 ------------ snip -------------- > > I think the first question to be asked is, which is the best approach, > categorical or continuous? > The continuous approach seems simpler and more efficient to me, but > output from the categorical approach may be more intuitive, for some > people. I think that this misses the point I was trying to make: lm() et al. treat logical variables as factors, not as numerical predictors. One could argue about what's the better approach but not about what lm() does. BTW, I prefer treating a logical predictor as a factor because the predictor is essentially categorical. > > I note that the use factors and characters, doesn't necessarily > produce consistent output, for $xlevels. > (Because factors can have their levels re-ordered). Again, this misses the point: Both factors and character predictors produce elements in $xlevels; logical predictors do not, even though they are treated in the model as factors. That factors have levels that aren't necessarily ordered alphabetically is a reason that I prefer using factors to using character predictors, but this has nothing to do with the point I was trying to make about $xlevels. Best, John ------------------------------------------------- John Fox, Professor Emeritus McMaster University Hamilton, Ontario, Canada Web: http::/socserv.mcmaster.ca/jfox ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel