Hi Ben, The following is oversimplified but hopefully helpful. Regression only works with numbers. The trick then becomes how to convert non-numeric data into meaningful numbers. For so-called continuous data (the type you get from running: rnorm(100) ), nothing needs to be done. For others (e.g., what you gey from sample(1:5, 100, replace = TRUE) ), the data may not be truly continuous, but it is often treated as such (this type is particularly common in the social sciences where questionnaires and surveys are administered and participants are asked to rate things on 1 to 5 or 1 to 7, or ... scales.
When you move on to data that is not really continuous and you do not want to treat as such (say first, second, third place), some schema has to be used to convert them. Most commonly, contrasts are used---thus certain levels are contrasted with others. In R, for ordered factors, the default contrasts are orthogonal polynomials. For example the contrasts for the first second, third example might be: contrasts(factor(1:3, ordered = TRUE)) .L and .Q stand for linear and quadratice, respectively. For k levels, there will be k - 1 contrast columns. This relaxes the linearity assuption applied to continuous data by testing the effects of first, second, etc. order polynomials. If the data have no meaningful order, say explaining levels of red bull consumption by college major, the default contrasts applied by R are "dummy codes". This picks one group (the lowest) as the referent, and compares the effect of all the other groups, relative to the referent. For example, suppose we had a small sample of only three college majors: contrasts(factor(1:3)) 1 is the reference group, the first contrast tests the effect of being in group 2 versus group 1, the second group 3 versus group 1. All of these work with logistic regression, or any flavour of general linear model (via the glm() and other functions). In many regards, the treatment of predictors in logistic regression is not any different from basic linear regression (ordinary least squares [OLS]). The logistic functions works on the outcome, not the predictors. That said, some special considerations do come into play. You need some variability on all of your predictors. In OLS with truly continuous data, if you have a two level nominal predictor with some people in each level, it is unlikely that any given cell would have all the same values. However, with a 0/1 outcome and a 0/1 predictor, it may be that in one particular cell, everyone has either a 0 or 1 for the outcome, which can be problematic for estimation purposes. What sorts of data are you dealing with? Is just entering the variables or using factor() not doing what you expect with some? I have not looked at the web page you referenced much but if you have an example type of data you feel is not covered or would like more fully covered, feel free to email me off list and I can add an example to the page. Cheers, Josh On Fri, Nov 25, 2011 at 2:09 PM, Ben quant <ccqu...@gmail.com> wrote: > Hello, > > Is there an example out there that shows how to treat each of the predictor > variable types when doing logistic regression in R? Something like this: > > glm(y~x1+x2+x3+x4, data=mydata, family=binomial(link="logit"), > na.action=na.pass) > > I'm drawing mostly from: > http://www.ats.ucla.edu/stat/r/dae/logit.htm > > ...but there are only two types of variable in the example given. I'm > wondering if the answer is that easy or if I have to consider more with > different types of variables. It seems like as.factor() is doing a lot of > the organization for me. > > I will need to understand how to perform logistic regression in R on all > data types all in the same model (potentially). > > As it stands, I think I can solve all of my data type issues with: > > as.factor(x,ordered=T) ...for all discrete ordinal variables > as.factor(x, ordered=F) ...for all discrete nominal variables > ...and do nothing for everything else. > > I'm pretty sure its not that simple because of some other posts I've seen, > but I haven't seen a post that discusses ALL data types in logistic > regression. > > Here is what I think will work at this point: > > glm(y ~ **all_other_vars + as.factor(disc_ord_var,ordered=T) + > as.factor(disc_nom_var,ordered=F), data=mydata, > family=binomial(link="logit"), na.action=na.pass) > > I'm also looking for any best practices help as well. I'm new'ish to > R...and oddly enough I haven't had the pleasure of doing much regression R > yet. > > Regards, > > Ben > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Joshua Wiley Ph.D. Student, Health Psychology Programmer Analyst II, ATS Statistical Consulting Group University of California, Los Angeles https://joshuawiley.com/ ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.