Re: [R] Predict when regressors are passed through a data matrix

Paolo Agnolucci Wed, 05 May 2010 08:49:50 -0700

Hi Dennis,



very thorough reply - I am amazed. I had realised that the problem was
related to colnames in the data.frame and had understood that putting both
regressand and regressors in the same data.frame was part of the solution. I
had figured it out that I could have solved that by adjusting the formula,
e.g. y~x1+ x2 in the case of my code, which being a string can be built with
a for loop over a list of variables the names of which can be determined at
run-time. I am using R through Python and everything needs to run without
any human intervention and without knowing which regressors are being used.
 The trick you suggest, i.e. lm(y ~ ., data = xx) allows to solve this
problem without bothering with building the y~ x1 + x2 + x3 etc. string,
which makes the code neater.

Thank you very much for spending time providing such a detailed answer

Paolo

On Wed, May 5, 2010 at 3:21 PM, Dennis Murphy <djmu...@gmail.com> wrote:

> Hi:
>
> The problem arises because the variable names of the explanatory variables
> in the newdata =
> data frame used in predict() have to match those in the fitted model
> object. Interestingly, using
> a matrix for the right hand side of the model formula in lm() creates
> problems for predict().
>
> Using your code,
>
> > x <- matrix(rnorm(30), ncol =2)
> > y <- 1 + 3*x[, 1] + 2*x[, 1] + rnorm(15)
> > m0 <- lm(y ~ x)
> > m0
> ...
> Coefficients:
> (Intercept)           x1           x2
>    0.590281     4.868230    -0.007012
>
> > new_x <- matrix(rnorm(2), ncol =2)
> > new_x.d <- data.frame(new_x)
> > new_x.d
>          X1        X2
> 1 0.1225315 0.8099963
>
> The names of the covariates in the model have names x1 and x2, whereas
> those in the
> data frame you want to use in predict() are X1 and X2, creating a name
> mismatch.
>
> The apparent 'solution' is to change the names in new_x.d to lower case,
> but interesting things happen...
> > names(new_x.d) <- c('x1', 'x2')
> > predict(m0, new_x.d)
>          1          2          3          4          5          6
> 7
>  1.1734885 -5.5551829  3.5652911  7.9607333 -9.4959770  4.3378850
> -3.5098720
>          8          9         10         11         12         13
> 14
> -2.1571867  3.8502343  5.8451436 -6.7490334  0.2203290 -4.2810391
> 0.4988267
>         15
>  6.8596084
> Warning message:
> 'newdata' had 1 rows but variable(s) found have 15 rows
> > new_x.d
>          x1        x2
> 1 0.1225315 0.8099963
>
> Even though the names (apparently) match now, predict() returns the
> predicted values from the original
> input *matrix*, and that turns out to matter...
>
> Let's go back to x and put some column names on it, refit the model and try
> predict() again:
>
> > colnames(x) <- c('x1', 'x2')
> > class(x)
> [1] "matrix"
> > m1 <- lm(y ~ x)
> > predict(m1, new_x.d)
> # Same as above...
>
> Although the variable names in the input matrix and new_x.d now match,
> predict()
> still 'misbehaves'.  To see why,
> > m1
> ...
> Coefficients:
> (Intercept)          xx1          xx2
>    0.590281     4.868230    -0.007012
>
> lm() tacks a leading x onto the variable names, thus causing another
> mismatch with
> variable names in predict().
>
> Now, combine x and y into a data frame, refit the model and try predict()
> again:
> > xx <- data.frame(y, x)
> # verify that it's a data frame with the right variable names...
> > str(xx)
> 'data.frame':   15 obs. of  3 variables:
>  $ y : num  0.236 -6.069 2.687 7.323 -10.028 ...
>  $ x1: num  0.12 -1.261 0.611 1.514 -2.069 ...
>  $ x2: num  0.367 1.192 -0.102 0.117 1.66 ...
>
> # Refit the model and run predict() again:
> > m2 <- lm(y ~ ., data = xx)
> > predict(m2, new_x.d)
>        1
> 1.181113
>
> Now it works.
>
> Evidently, inputting a matrix for the right hand side of the model formula
> in lm() creates
> problems for predict(). According to the help page, the first argument of
> predict.lm() is
> an object of class lm, whereas the second argument is a data frame. As it
> turns out, the
> key phrase needed to understand what's going on is the following:
>
> predict.lm produces predicted values, obtained by evaluating the
> regression function in the frame newdata
> (which defaults to model.frame(object)).
>
> The names of the model.frame() objects in the three models are:
> > names(model.frame(m0))    # x is a matrix, no colnames
> [1] "y" "x"
> > names(model.frame(m1))     # x is a matrix with colnames
> [1] "y" "x"
> > names(model.frame(m2))    # x1 and x2 are variables in a data frame
> [1] "y"  "x1" "x2"
>
> Notice that these are the same as the objects given in the respective model
> formulas.
>
> Moreover,
> > head(model.frame(m0), 1)
>           y       x.1       x.2
> 1 0.2355153 0.1203279 0.3674401
> > head(model.frame(m1), 1)
>           y      x.x1      x.x2
> 1 0.2355153 0.1203279 0.3674401
> > head(model.frame(m2), 1)
>           y        x1        x2
> 1 0.2355153 0.1203279 0.3674401
>
> Now, one can see that the names assigned to the covariates by model.frame()
> when x is a
> matrix depend on the column names assigned to the input matrix. Does this
> help?
>
> Let's copy new_x.d to another data frame object and rename the variables
> for
> prediction with m0:
> > new0 <- new_x.d
> > names(new0) <- c('x.1', 'x.2')
> > predict(m0, new0)
>          1          2          3          4          5          6
> 7
>  1.1734885 -5.5551829  3.5652911  7.9607333 -9.4959770  4.3378850
> -3.5098720
>          8          9         10         11         12         13
> 14
> -2.1571867  3.8502343  5.8451436 -6.7490334  0.2203290 -4.2810391
> 0.4988267
>         15
>  6.8596084
> Warning message:
> 'newdata' had 1 rows but variable(s) found have 15 rows
> > new0
>         x.1       x.2
> 1 0.1225315 0.8099963
>
> That doesn't help, either. lm() is not recognizing x.1 and x.2 as variable
> names in the model
> frame of m0, and this is seen in names(model.frame(m0)).
>
>  The moral seems to be: to use predict() predictably, make sure that the
> inputs to lm() are
>  in a data frame. One experiences far fewer headaches that way.
>
> A clearer, pithier explanation of why this phenomenon occurs would be
> welcome, too :)
>
> HTH,
> Dennis
>
>
>   On Wed, May 5, 2010 at 3:16 AM, Paolo Agnolucci <
> agnolucp...@googlemail.com> wrote:
>
>>  Hi everyone,
>>
>> this should be pretty basic but I need asking for help as I got stuck.
>>
>> I am running simple linear regression models on R with k regressors where
>> k
>> > 1. In order to automate my code I packed all the regressors in a matrix
>> X
>> so that lm(y~X) will always produce the results I want regardless of the
>> variables in X. I am new to R but I found this advice somewhere so I guess
>> it is relatively standard practice. This works very well until I need to
>> forecast using the estimate model.
>>
>> I cannot pass a matrix to predict - when I pass a data frame I get the
>> fitted valuie which leads me to think that R doesnt see the data.frame I
>> pass to predict
>>
>> Thanks in advance,
>>
>> Paolo
>>
>>
>>
>> # REPRODUCIBLE CODE
>> x <- matrix(rnorm(30), ncol =2)
>> y <- 1 + 3*x[, 1] + 2*x[, 1] + rnorm(15)
>> new_x <- matrix(rnorm(2), ncol =2)
>> new_x.d <- data.frame(new_x)
>>
>> # fitted values
>> predict(lm(y ~ x))
>>
>> # same as fitted values
>> predict(lm(y ~ x), new_x.d)
>>
>> # error
>> predict(lm(y ~ x), new_x)
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Predict when regressors are passed through a data matrix

Reply via email to