Hi Dennis,
very thorough reply - I am amazed. I had realised that the problem was related to colnames in the data.frame and had understood that putting both regressand and regressors in the same data.frame was part of the solution. I had figured it out that I could have solved that by adjusting the formula, e.g. y~x1+ x2 in the case of my code, which being a string can be built with a for loop over a list of variables the names of which can be determined at run-time. I am using R through Python and everything needs to run without any human intervention and without knowing which regressors are being used. The trick you suggest, i.e. lm(y ~ ., data = xx) allows to solve this problem without bothering with building the y~ x1 + x2 + x3 etc. string, which makes the code neater. Thank you very much for spending time providing such a detailed answer Paolo On Wed, May 5, 2010 at 3:21 PM, Dennis Murphy <djmu...@gmail.com> wrote: > Hi: > > The problem arises because the variable names of the explanatory variables > in the newdata = > data frame used in predict() have to match those in the fitted model > object. Interestingly, using > a matrix for the right hand side of the model formula in lm() creates > problems for predict(). > > Using your code, > > > x <- matrix(rnorm(30), ncol =2) > > y <- 1 + 3*x[, 1] + 2*x[, 1] + rnorm(15) > > m0 <- lm(y ~ x) > > m0 > ... > Coefficients: > (Intercept) x1 x2 > 0.590281 4.868230 -0.007012 > > > new_x <- matrix(rnorm(2), ncol =2) > > new_x.d <- data.frame(new_x) > > new_x.d > X1 X2 > 1 0.1225315 0.8099963 > > The names of the covariates in the model have names x1 and x2, whereas > those in the > data frame you want to use in predict() are X1 and X2, creating a name > mismatch. > > The apparent 'solution' is to change the names in new_x.d to lower case, > but interesting things happen... > > names(new_x.d) <- c('x1', 'x2') > > predict(m0, new_x.d) > 1 2 3 4 5 6 > 7 > 1.1734885 -5.5551829 3.5652911 7.9607333 -9.4959770 4.3378850 > -3.5098720 > 8 9 10 11 12 13 > 14 > -2.1571867 3.8502343 5.8451436 -6.7490334 0.2203290 -4.2810391 > 0.4988267 > 15 > 6.8596084 > Warning message: > 'newdata' had 1 rows but variable(s) found have 15 rows > > new_x.d > x1 x2 > 1 0.1225315 0.8099963 > > Even though the names (apparently) match now, predict() returns the > predicted values from the original > input *matrix*, and that turns out to matter... > > Let's go back to x and put some column names on it, refit the model and try > predict() again: > > > colnames(x) <- c('x1', 'x2') > > class(x) > [1] "matrix" > > m1 <- lm(y ~ x) > > predict(m1, new_x.d) > # Same as above... > > Although the variable names in the input matrix and new_x.d now match, > predict() > still 'misbehaves'. To see why, > > m1 > ... > Coefficients: > (Intercept) xx1 xx2 > 0.590281 4.868230 -0.007012 > > lm() tacks a leading x onto the variable names, thus causing another > mismatch with > variable names in predict(). > > Now, combine x and y into a data frame, refit the model and try predict() > again: > > xx <- data.frame(y, x) > # verify that it's a data frame with the right variable names... > > str(xx) > 'data.frame': 15 obs. of 3 variables: > $ y : num 0.236 -6.069 2.687 7.323 -10.028 ... > $ x1: num 0.12 -1.261 0.611 1.514 -2.069 ... > $ x2: num 0.367 1.192 -0.102 0.117 1.66 ... > > # Refit the model and run predict() again: > > m2 <- lm(y ~ ., data = xx) > > predict(m2, new_x.d) > 1 > 1.181113 > > Now it works. > > Evidently, inputting a matrix for the right hand side of the model formula > in lm() creates > problems for predict(). According to the help page, the first argument of > predict.lm() is > an object of class lm, whereas the second argument is a data frame. As it > turns out, the > key phrase needed to understand what's going on is the following: > > predict.lm produces predicted values, obtained by evaluating the > regression function in the frame newdata > (which defaults to model.frame(object)). > > The names of the model.frame() objects in the three models are: > > names(model.frame(m0)) # x is a matrix, no colnames > [1] "y" "x" > > names(model.frame(m1)) # x is a matrix with colnames > [1] "y" "x" > > names(model.frame(m2)) # x1 and x2 are variables in a data frame > [1] "y" "x1" "x2" > > Notice that these are the same as the objects given in the respective model > formulas. > > Moreover, > > head(model.frame(m0), 1) > y x.1 x.2 > 1 0.2355153 0.1203279 0.3674401 > > head(model.frame(m1), 1) > y x.x1 x.x2 > 1 0.2355153 0.1203279 0.3674401 > > head(model.frame(m2), 1) > y x1 x2 > 1 0.2355153 0.1203279 0.3674401 > > Now, one can see that the names assigned to the covariates by model.frame() > when x is a > matrix depend on the column names assigned to the input matrix. Does this > help? > > Let's copy new_x.d to another data frame object and rename the variables > for > prediction with m0: > > new0 <- new_x.d > > names(new0) <- c('x.1', 'x.2') > > predict(m0, new0) > 1 2 3 4 5 6 > 7 > 1.1734885 -5.5551829 3.5652911 7.9607333 -9.4959770 4.3378850 > -3.5098720 > 8 9 10 11 12 13 > 14 > -2.1571867 3.8502343 5.8451436 -6.7490334 0.2203290 -4.2810391 > 0.4988267 > 15 > 6.8596084 > Warning message: > 'newdata' had 1 rows but variable(s) found have 15 rows > > new0 > x.1 x.2 > 1 0.1225315 0.8099963 > > That doesn't help, either. lm() is not recognizing x.1 and x.2 as variable > names in the model > frame of m0, and this is seen in names(model.frame(m0)). > > The moral seems to be: to use predict() predictably, make sure that the > inputs to lm() are > in a data frame. One experiences far fewer headaches that way. > > A clearer, pithier explanation of why this phenomenon occurs would be > welcome, too :) > > HTH, > Dennis > > > On Wed, May 5, 2010 at 3:16 AM, Paolo Agnolucci < > agnolucp...@googlemail.com> wrote: > >> Hi everyone, >> >> this should be pretty basic but I need asking for help as I got stuck. >> >> I am running simple linear regression models on R with k regressors where >> k >> > 1. In order to automate my code I packed all the regressors in a matrix >> X >> so that lm(y~X) will always produce the results I want regardless of the >> variables in X. I am new to R but I found this advice somewhere so I guess >> it is relatively standard practice. This works very well until I need to >> forecast using the estimate model. >> >> I cannot pass a matrix to predict - when I pass a data frame I get the >> fitted valuie which leads me to think that R doesnt see the data.frame I >> pass to predict >> >> Thanks in advance, >> >> Paolo >> >> >> >> # REPRODUCIBLE CODE >> x <- matrix(rnorm(30), ncol =2) >> y <- 1 + 3*x[, 1] + 2*x[, 1] + rnorm(15) >> new_x <- matrix(rnorm(2), ncol =2) >> new_x.d <- data.frame(new_x) >> >> # fitted values >> predict(lm(y ~ x)) >> >> # same as fitted values >> predict(lm(y ~ x), new_x.d) >> >> # error >> predict(lm(y ~ x), new_x) >> >> [[alternative HTML version deleted]] >> >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html> >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.