You shouldn't have to do any of what you are doing. See ?predict.lm and note the "newdata" argument.
Also, you should spend some time studying a linear model text, as your question appears to indicate some basic confusion (e.g. about "contrasts" ) about how they work. Bert Gunter "The trouble with having an open mind is that people keep coming along and sticking things into it." -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip ) On Sat, Nov 17, 2018 at 1:24 PM Julian Righ Sampedro <julian.righ.sampe...@gmail.com> wrote: > > Dear all, > > In a context of regression, I have three regressors, two of which are > categorical variables (sex and education) and have class 'factor'. > > y = data$income > x1 = as.factor(data$sex) # 2 levels > x2 = data$age # continuous > x3 = as.factor(data$ed) # 8 levels > > for example, the first entries of x3 are > > head(x3)[1] 5 3 5 5 4 2 > Levels: 1 2 3 4 5 6 7 8 > > When we call the model, the output looks like this > > model1=lm(y ~ x1 + x2 + x3, data = data) > summary(model1) > > Residuals: > Min 1Q Median 3Q Max -31220 -6300 -594 4429 190731 > > Coefficients: > Estimate Std. Error t value Pr(>|t|) (Intercept) 1440.66 > 3809.99 0.378 0.705417 > x1 -4960.88 772.96 -6.418 2.13e-10 *** > x2 181.45 25.03 7.249 8.41e-13 *** > x32 2174.95 3453.22 0.630 0.528948 > x33 7497.68 3428.94 2.187 0.029004 * > x34 8278.97 3576.30 2.315 0.020817 * > x35 13686.88 3454.93 3.962 7.97e-05 *** > x36 15902.92 4408.49 3.607 0.000325 *** > x37 28773.13 3696.77 7.783 1.76e-14 *** > x38 31455.55 5448.11 5.774 1.03e-08 ***--- > Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 > > Residual standard error: 12060 on 1001 degrees of freedom > Multiple R-squared: 0.2486, Adjusted R-squared: 0.2418 > F-statistic: 36.79 on 9 and 1001 DF, p-value: < 2.2e-16 > > Now suppose I want to compute the residuals. To do so I first need to > compute the prediction by the model. (I use it in a cross validation > context so it is a partial display of the code) > > yhat1 = model1$coef[1] + model1$coef[2]*x1[i] + model1$coef[3]*x2[i] + > model1$coef[4]*x3[i] > > But I get the following warnings > > Warning messages:1: In Ops.factor(model1$coef[2], x1[i]) : ‘*’ not > meaningful for factors2: In Ops.factor(model1$coef[4], x3[i]) : ‘*’ > not meaningful for factors > > 1st question: Is there a way to multiply the coefficient by the 'factor' > without having to transform my 'factor' into a 'numeric' type variable ? > > 2nd question: Since x3 is associated with 7 parameters (one for x32, one > for x33, ... , one for x38), how do I multiply the 'correct' parameter > coefficient with my 'factor' x3 ? > > I have been considering a 'if then' solution, but to no avail. I also have > considered splitting my x3 variable into 8 binary variables without > succeeding. What may be the best approach ? Thank you for your help. > > Since I understand this my not be specific enough, I add here the complete > code > > # for n-fold cross validation# fit models on leave-one-out samples > x1= as.factor(data$sex) > x2= data$age > x3= as.factor(data$ed) > yn=data$income > n = length(yn) > e1 = e2 = numeric(n) > > for (i in 1:n) { > # the ith observation is excluded > y = yn[-i] > x_1 = x1[-i] > x_2 = x2[-i] > x_3 = x3[-i] > x_4 = as.factor(cf4)[-i] > # fit the first model without the ith observation > J1 = lm(y ~ x_1 + x_2 + x_3) > yhat1 = J1$coef[1] + J1$coef[2]*x1[i] + J1$coef[3]*x2[i] + J1$coef[4]*x3[i] > # construct the ith part of the loss function for model 1 > e1[i] = yn[i] - yhat1 > # fit the second model without the ith observation > J2 = lm(y ~ x_1 + x_2 + x_3 + x_4) > > yhat2=J2$coef[1]+J2$coef[2]*x1[i]+J2$coef[3]*x2[i]+J2$coef[4]*x3[i]+J2$coef[5]*cf4[i] > e2[i] = yn[i] - yhat2 > } > sqrt(c(mean(e1^2),mean(e2^2))) # RMSE > > cf4 is a variable corresponding to groups (using mixtures) . What we want > to demonstrate is that the prediction error after cross-validation is lower > when we include this latent grouping variable. It works wonders when the > categorical variables are treated as 'numeric' variables. Though the ols > estimates are obviously very different. > > Thank you in advance for your views on the problem. > > Best Regards, > > julian > > [[alternative HTML version deleted]] > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.