> Functions like lm() treat logical predictors as factors, *not* as numerical variables.
Not quite. A factor with all elements the same causes lm() to give an error while a logical of all TRUEs or all FALSEs just omits it from the model (it gets a coefficient of NA). This is a fairly common situation when you fit models to subsets of a big data.frame. This is an argument for fixing the single-valued-factor problem, which would become more noticeable if logicals were treated as factors. > d <- data.frame(Age=c(2,4,6,8,10), Weight=c(878, 890, 930, 800, 750), Diseased=c(FALSE,FALSE,FALSE,TRUE,TRUE)) > coef(lm(data=d, Weight ~ Age + Diseased)) (Intercept) Age DiseasedTRUE 877.7333 5.4000 -151.3333 > coef(lm(data=d, Weight ~ Age + factor(Diseased))) (Intercept) Age factor(Diseased)TRUE 877.7333 5.4000 -151.3333 > coef(lm(data=d, Weight ~ Age + Diseased, subset=Age<7)) (Intercept) Age DiseasedTRUE 847.3333 13.0000 NA > coef(lm(data=d, Weight ~ Age + factor(Diseased), subset=Age<7)) Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels > coef(lm(data=d, Weight ~ Age + factor(Diseased, levels=c(FALSE,TRUE)), subset=Age<7)) Error in `contrasts<-`(`*tmp*`, value = contr.funs[1 + isOF[nn]]) : contrasts can be applied only to factors with 2 or more levels Bill Dunlap TIBCO Software wdunlap tibco.com On Sat, Aug 31, 2019 at 8:54 AM Fox, John <j...@mcmaster.ca> wrote: > Dear Abby, > > > On Aug 30, 2019, at 8:20 PM, Abby Spurdle <spurdl...@gmail.com> wrote: > > > >> I think that it would be better to handle factors, character > predictors, and logical predictors consistently. > > > > "logical predictors" can be regarded as categorical or continuous (i.e. > 0 or 1). > > And the model matrix should be the same, either way. > > I think that you're mistaking a coincidence for a principle. The > coincidence is that FALSE/TRUE coerces to 0/1 and sorts to FALSE, TRUE. > Functions like lm() treat logical predictors as factors, *not* as numerical > variables. > > That one would get the same coefficient in either case is a consequence of > the coincidence and the fact that the default contrasts for unordered > factors are contr.treatment(). For example, if you changed the contrasts > option, you'd get a different estimate (though of course a model with the > same fit to the data and an equivalent interpretation): > > ------------ snip -------------- > > > options(contrasts=c("contr.sum", "contr.poly")) > > m3 <- lm(Sepal.Length ~ Sepal.Width + I(Species == "setosa"), data=iris) > > m3 > > Call: > lm(formula = Sepal.Length ~ Sepal.Width + I(Species == "setosa"), > data = iris) > > Coefficients: > (Intercept) Sepal.Width I(Species == "setosa")1 > 2.6672 0.9418 0.8898 > > > head(model.matrix(m3)) > (Intercept) Sepal.Width I(Species == "setosa")1 > 1 1 3.5 -1 > 2 1 3.0 -1 > 3 1 3.2 -1 > 4 1 3.1 -1 > 5 1 3.6 -1 > 6 1 3.9 -1 > > tail(model.matrix(m3)) > (Intercept) Sepal.Width I(Species == "setosa")1 > 145 1 3.3 1 > 146 1 3.0 1 > 147 1 2.5 1 > 148 1 3.0 1 > 149 1 3.4 1 > 150 1 3.0 1 > > > lm(Sepal.Length ~ Sepal.Width + as.numeric(Species == "setosa"), > data=iris) > > Call: > lm(formula = Sepal.Length ~ Sepal.Width + as.numeric(Species == > "setosa"), data = iris) > > Coefficients: > (Intercept) Sepal.Width > as.numeric(Species == "setosa") > 3.5571 0.9418 > -1.7797 > > > -2*coef(m3)[3] > I(Species == "setosa")1 > -1.779657 > > ------------ snip -------------- > > > > > > I think the first question to be asked is, which is the best approach, > > categorical or continuous? > > The continuous approach seems simpler and more efficient to me, but > > output from the categorical approach may be more intuitive, for some > > people. > > I think that this misses the point I was trying to make: lm() et al. treat > logical variables as factors, not as numerical predictors. One could argue > about what's the better approach but not about what lm() does. BTW, I > prefer treating a logical predictor as a factor because the predictor is > essentially categorical. > > > > > I note that the use factors and characters, doesn't necessarily > > produce consistent output, for $xlevels. > > (Because factors can have their levels re-ordered). > > Again, this misses the point: Both factors and character predictors > produce elements in $xlevels; logical predictors do not, even though they > are treated in the model as factors. That factors have levels that aren't > necessarily ordered alphabetically is a reason that I prefer using factors > to using character predictors, but this has nothing to do with the point I > was trying to make about $xlevels. > > Best, > John > > ------------------------------------------------- > John Fox, Professor Emeritus > McMaster University > Hamilton, Ontario, Canada > Web: http::/socserv.mcmaster.ca/jfox > > ______________________________________________ > R-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-devel > [[alternative HTML version deleted]] ______________________________________________ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel