> > On Aug 20, 2009, at 4:07 PM, g...@ucalgary.ca wrote: > >>> >>> On Aug 20, 2009, at 3:42 PM, g...@ucalgary.ca wrote: >>> >>>> Thanks! >>>>> >>>>> On Aug 20, 2009, at 1:46 PM, g...@ucalgary.ca wrote: >>>>> >>>>>> I got two questions on factors in regression: >>>>>> >>>>>> Q1. >>>>>> In a table, there a few categorical/factor variables, a few >>>>>> numerical >>>>>> variables and the response variable is numeric. Some factors are >>>>>> important >>>>>> but others not. >>>>>> How to determine which categorical variables are significant to >>>>>> the >>>>>> response variable? >>>>> >>>>> Seems that you should engage the services of a consulting >>>>> statistician >>>>> for that sort of question. Or post in a venue where statistical >>>>> consulting is supposed to occur, such as one of the sci.stat.* >>>>> newsgroups. >>>> >>>> I googled sci.stat.* and got sci.stat.math and sci.stat.consult. >>>> Are they good? >>> >>> The quality of responses varies. You may get what you pay for. On the >>> other hand sometimes you get high-quality advice for free. >>> >>>> I have no idea to do this. So any clue will be appreciated. >>> >>> http://groups.google.com/?hl=en >>> >>>> >>>>> >>>>>> >>>>>> Q2. >>>>>> As we knew, lm can deal with categorical variables. >>>>>> I thought, when there is a categorical predictor, we may use lm >>>>>> directly >>>>>> without quantifying these factors and assigning different values >>>>>> to >>>>>> factors >>>>>> would not change the fittings as shown: >>>>> >>>>> The "numbers" that you are attempting to assign are really just >>>>> labels >>>>> for the factor levels. The regression functions in R will not use >>>>> them >>>>> for any calculations. They should not be thought of as having >>>>> "values". Even if the factor is an ordered factor, the labels may >>>>> not >>>>> be interpretable as having the same numerical order as the string >>>>> values might suggest. >>>>> >>>>>> >>>>>> x <- 1:20 ## numeric predictor >>>>>> yes.no <- c("yes","no") >>>>>> factors <- gl(2,10,20,yes.no) ##factor predictor >>>>>> factors.quant <- rep(c(18.8,29.9),c(10,10)) ##quantificatio of >>>>>> factors >>>>> >>>>> Not sure what that is supposed to mean. It is not a factor object >>>>> even >>>>> though you may be misleading yourself in to believing it should be. >>>>> It's a numeric vector. >>>> >>>> Yes, levels are not numeric but just labels. But >>>> after the levels factors being assigned to numeric values as >>>> factors.quant >>>> and factors.quant.1, >>>> lm(response ~ x + factors.quant) and lm(response ~ x + >>>> factors.quant1) >>>> produced the same fitted curve as lm(response ~ x + factors). This >>>> is what >>>> I could not understand. >>> >>> In for the factor variable case and the numeric variable case there >>> was no variation in the predictor variable within a level. So the >>> predictions will all be the same within levels in each case. There >>> will be differences in the coefficients arrived at to achieve that >>> result, however. >> >> I even tried >> >>> cor(response, factors) >> [1] 0.968241 >>> cor(response, factors.quant) >> [1] 0.968241 >>> cor(response, factors.quant.1) >> [1] 0.968241 >> >> If assigning values to factors does not change curve-fitting, >> one may use factors.quant to do regression analysis if he wants to >> find the curve patterns. >> The coefficients are different since they use different predictors. >> If they are the same, then the curves fitted are different. > > Try setting up with 3 factor levels and three discrete values for the > numeric predictor. the cor() function will continue to give meaningful > results for the numeric variable but not for the factor variable. The > interpretation of the coefficients from a model with three level > factors may require further study on your part. > Yes, when the number of the levels is greater than or equal 3. That is not true. Thanks, > >> >> Can I rank factors.1 and factors.2 using >> cor(response factors.1) and cor(response factors.1)? >> Thanks, >>> >>>> >>>>>> str(factors.quant) >>>>> num [1:20] 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 18.8 ... >>>>> >>>>>> factors.quant.1 <- rep(c(16.9,38.9),c(10,10)) >>>>>> ##second quantificatio of factors >>>>>> response <- 0.8*x + 18 + factors.quant + rnorm(20) ##response >>>>>> lm.quant <- lm(response ~ x + factors.quant) ##lm with >>>>>> quantifications >>>>>> lm.fact <- lm(response ~ x + factors) ##lm with factors >>>>> >>>>>> lm.quant >>>>> >>>>> Call: >>>>> lm(formula = response ~ x + factors.quant) >>>>> >>>>> Coefficients: >>>>> (Intercept) x factors.quant >>>>> 14.9098 0.5385 1.2350 >>>>> >>>>>> lm.fact >>>>> >>>>> Call: >>>>> lm(formula = response ~ x + factors) >>>>> >>>>> Coefficients: >>>>> (Intercept) x factorsno >>>>> 38.1286 0.5385 13.7090 >>>>>> >>>>>> lm.quant.1 <- lm(response ~ x + factors.quant.1) ##lm with >>>>>> quantifications >>>>> >>>>>> lm.quant.1 >>>>> >>>>> Call: >>>>> lm(formula = response ~ x + factors.quant.1) >>>>> >>>>> Coefficients: >>>>> (Intercept) x factors.quant.1 >>>>> 27.5976 0.5385 0.6231 >>>>> >>>>>> lm.fact.1 <- lm(response ~ x + factors) ##lm with factors >>>>>> >>>>>> par(mfrow=c(2,2)) ## comparisons of two fittings >>>>>> plot(x, response) >>>>>> lines(x,fitted(lm.quant),col="blue") >>>>>> grid() >>>>>> plot(x,response) >>>>>> lines(x,fitted(lm.fact),col = "red") >>>>>> grid() >>>>>> plot(x, response) >>>>>> lines(x,fitted(lm.quant.1),lty =2,col="blue") >>>>>> grid() >>>>>> plot(x,response) >>>>>> lines(x,fitted(lm.fact.1),lty =2,col = "red") >>>>>> grid() >>>>>> par(mfrow = c(1,1)) >>>>>> >>>>>> So, is it right that we can assign any numeric values to factors, >>>>>> for example, c(yes, no) = c(18.8,29.9) or (16.9,38.9) in the >>>>>> above, >>>>>> before doing lm, glm, aov, even nls? >>>>> >>>>> You can give factor levels any name you like, including any >>>>> sequence >>>>> of digit characters. Unlike "ordinary R where unquoted numbers >>>>> cannot >>>>> start variable names, factor functions will coerce numeric >>>>> vectors to >>>>> character vectors when assigning level names. But you seem to be >>>>> conflating factors with numeric vectors that have many ties. Those >>>>> two >>>>> entities would have different handling by R's regression functions. > -- > > > David Winsemius, MD > Heritage Laboratories > West Hartford, CT > > >
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.