Re: [R] Questions on factors in regression analysis

David Winsemius Thu, 20 Aug 2009 13:39:17 -0700


On Aug 20, 2009, at 4:07 PM, g...@ucalgary.ca wrote:


On Aug 20, 2009, at 3:42 PM, g...@ucalgary.ca wrote:

Thanks!


On Aug 20, 2009, at 1:46 PM, g...@ucalgary.ca wrote:

I got two questions on factors in regression:

Q1.
In a table, there a few categorical/factor variables, a few
numerical
variables and the response variable is numeric. Some factors are
important
but others not.

How to determine which categorical variables are significant tothe

response variable?


Seems that you should engage the services of a consulting
statistician
for that sort of question. Or post in a venue where statistical
consulting is supposed to occur, such as one of the sci.stat.*
newsgroups.


I googled sci.stat.* and got sci.stat.math and sci.stat.consult.
Are they good?


The quality of responses varies. You may get what you pay for. On the
other hand sometimes you get high-quality advice for free.

I have no idea to do this. So any clue will be appreciated.


http://groups.google.com/?hl=en


Q2.
As we knew, lm can deal with categorical variables.
I thought, when there is a categorical predictor, we may use lm
directly

without quantifying these factors and assigning different valuesto

factors
would not change the fittings as shown:


The "numbers" that you are attempting to assign are really just
labels
for the factor levels. The regression functions in R will not use
them
for any calculations. They should not be thought of as having

"values". Even if the factor is an ordered factor, the labels maynot

be interpretable as having the same numerical order as the string
values might suggest.


x <- 1:20 ## numeric predictor
yes.no <- c("yes","no")
factors <- gl(2,10,20,yes.no) ##factor predictor
factors.quant <-  rep(c(18.8,29.9),c(10,10)) ##quantificatio of
factors


Not sure what that is supposed to mean. It is not a factor object
even
though you may be misleading yourself in to believing it should be.
It's a numeric vector.


Yes, levels are not numeric but just labels. But
after the levels factors being assigned to numeric values as
factors.quant
and factors.quant.1,

lm(response ~ x + factors.quant) and lm(response ~ x +factors.quant1)

produced the same fitted curve as lm(response ~ x + factors). This
is what
I could not understand.


In for the factor variable case and the numeric variable case there
was no variation in the predictor variable within a level. So the
predictions will all be the same within levels in each case. There
will be differences in the coefficients arrived at to achieve that
result, however.


I even tried

cor(response, factors)

[1] 0.968241

cor(response, factors.quant)

[1] 0.968241

cor(response, factors.quant.1)

[1] 0.968241

If assigning values to factors does not change curve-fitting,
one may use factors.quant to do regression analysis if he wants to
find the curve patterns.
The coefficients are different since they use different predictors.
If they are the same, then the curves fitted are different.

Try setting up with 3 factor levels and three discrete values for thenumeric predictor. the cor() function will continue to give meaningfulresults for the numeric variable but not for the factor variable. Theinterpretation of the coefficients from a model with three levelfactors may require further study on your part.


Can I rank factors.1 and factors.2 using
cor(response factors.1) and cor(response factors.1)?
Thanks,

str(factors.quant)

num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ...

factors.quant.1 <-  rep(c(16.9,38.9),c(10,10))
##second quantificatio of factors
response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response
lm.quant <- lm(response ~ x + factors.quant) ##lm with
quantifications
lm.fact <- lm(response ~ x + factors) ##lm with factors

lm.quant


Call:
lm(formula = response ~ x + factors.quant)

Coefficients:
 (Intercept)              x  factors.quant
     14.9098         0.5385         1.2350

lm.fact


Call:
lm(formula = response ~ x + factors)

Coefficients:
(Intercept)            x    factorsno
   38.1286       0.5385      13.7090


lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with
quantifications

lm.quant.1


Call:
lm(formula = response ~ x + factors.quant.1)

Coefficients:
   (Intercept)                x  factors.quant.1
       27.5976           0.5385           0.6231

lm.fact.1 <- lm(response ~ x + factors) ##lm with factors

par(mfrow=c(2,2)) ## comparisons of two fittings
plot(x, response)
lines(x,fitted(lm.quant),col="blue")
grid()
plot(x,response)
lines(x,fitted(lm.fact),col = "red")
grid()
plot(x, response)
lines(x,fitted(lm.quant.1),lty =2,col="blue")
grid()
plot(x,response)
lines(x,fitted(lm.fact.1),lty =2,col = "red")
grid()
par(mfrow = c(1,1))

So, is it right that we can assign any numeric values to factors,

for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in theabove,

before doing lm, glm, aov, even nls?

You can give factor levels any name you like, including anysequenceof digit characters. Unlike "ordinary R where unquoted numberscannotstart variable names, factor functions will coerce numericvectors to

character vectors when assigning level names. But you seem to be
conflating factors with numeric vectors that have many ties. Those
two
entities would have different handling by R's regression functions.

--


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Questions on factors in regression analysis

Reply via email to