Re: [R] Problem with lm.resid() when weights are provided

Fox, John Fri, 14 Sep 2018 14:46:56 -0700

Dear Hamed,

When you post a question to r-help, generally you should cc subsequent messages 
there as well, as I've done to this response.


The algorithm that lm() uses is much more numerically stable than inverting the 
weighted sum-of-squares-and-product matrix. If you want to see how the 
computations are done, look at lm.wfit(), in which the residuals and fits are 
computed as 

    z$residuals <- z$residuals/wts
    z$fitted.values <- y - z$residuals

Zero weights are handled specially, and your tiny weights are thus the source 
of the problem. When you divide by a number less than the machine 
double-epsilon, you can't expect numerically stable results. I suppose that 
lm.wfit() could check for 0 weights to a tolerance rather than exactly.

John

> -----Original Message-----
> From: Hamed Ha [mailto:hamedhas...@gmail.com]
> Sent: Friday, September 14, 2018 5:34 PM
> To: Fox, John <j...@mcmaster.ca>
> Subject: Re: [R] Problem with lm.resid() when weights are provided
> 
> Hi John,
> 
> Thank you for your reply.
> 
> I agree that the small weights are the potential source of the instability in 
> the
> result. I also suspected that there are some failure/bugs in the actual
> algorithm that R uses for fitting the model. I remember that at some points I
> checked the theoretical estimation of the parameters, solve(t(x) %*% w %*%
> x) %*% t(x) %*% w %*% y, (besides the point that I had to set tol parameter in
> solve() to a super small value) and realised  that lm() and the theoretical
> results match together. That is the parameter estimation is right in R.
> Moreover, I checked the predictions, predict(lm.fit), and it was right. Then 
> the
> only source of error remained was resid() function. I further checked this
> function and it is nothing more than calling a sub-element from and lm() fit.
> Putting all together, I think that there is something wrong/bug/miss-
> configuration in the lm() algorithm and I highly recommend the R core team to
> fix that.
> 
> Please feel free to contact me for more details if required.
> 
> Warm regards,
> Hamed.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On Fri, 14 Sep 2018 at 13:35, Fox, John <j...@mcmaster.ca
> <mailto:j...@mcmaster.ca> > wrote:
> 
> 
>       Dear Hamed,
> 
>       I don't think that anyone has picked up on this problem.
> 
>       What's peculiar about your weights is that several are 0 within
> rounding error but not exactly 0:
> 
>       > head(df)
>                  y          x       weight
>       1  1.5115614  0.5520924 2.117337e-34
>       2 -0.6365313 -0.1259932 2.117337e-34
>       3  0.3778278  0.4209538 4.934135e-31
>       4  3.0379232  1.4031545 2.679495e-24
>       5  1.5364652  0.4607686 2.679495e-24
>       6 -2.3772787 -0.7396358 6.244160e-21
> 
>       I can reproduce the results that you report:
> 
>       > (mod.1 <- lm(y ~ x, data=df))
> 
>       Call:
>       lm(formula = y ~ x, data = df)
> 
>       Coefficients:
>       (Intercept)            x
>          -0.04173      2.03790
> 
>       > max(resid(mod.1))
>       [1] 1.14046
>       > (mod.2 <- lm(y ~ x, data=df, weights=weight))
> 
>       Call:
>       lm(formula = y ~ x, data = df, weights = weight)
> 
>       Coefficients:
>       (Intercept)            x
>          -0.05786      1.96087
> 
>       > max(resid(mod.2))
>       [1] 36.84939
> 
>       But the problem disappears when the tiny nonzero weight are set to 0:
> 
>       > df2 <- df
>       > df2$weight <- zapsmall(df2$weight)
>       > head(df2)
>                  y          x weight
>       1  1.5115614  0.5520924      0
>       2 -0.6365313 -0.1259932      0
>       3  0.3778278  0.4209538      0
>       4  3.0379232  1.4031545      0
>       5  1.5364652  0.4607686      0
>       6 -2.3772787 -0.7396358      0
>       > (mod.3 <- update(mod.2, data=df2))
> 
>       Call:
>       lm(formula = y ~ x, data = df2, weights = weight)
> 
>       Coefficients:
>       (Intercept)            x
>          -0.05786      1.96087
> 
>       > max(resid(mod.3))
>       [1] 1.146663
> 
>       I don't know exactly why this happens, but suspect numerical
> instability produced by the near-zero weights, which are smaller than the
> machine double-epsilon
> 
>       > .Machine$double.neg.eps
>       [1] 1.110223e-16
> 
>       The problem also disappears, e.g., if the tiny weight are set to 1e-15
> rather than 0.
> 
>       I hope this helps,
>        John
> 
>       -----------------------------------------------------------------
>       John Fox
>       Professor Emeritus
>       McMaster University
>       Hamilton, Ontario, Canada
>       Web: https://socialsciences.mcmaster.ca/jfox/
> 
> 
> 
>       > -----Original Message-----
>       > From: R-help [mailto:r-help-boun...@r-project.org <mailto:r-help-
> boun...@r-project.org> ] On Behalf Of Hamed Ha
>       > Sent: Tuesday, September 11, 2018 8:39 AM
>       > To: r-help@r-project.org <mailto:r-help@r-project.org>
>       > Subject: [R] Problem with lm.resid() when weights are provided
>       >
>       > Dear R Help Team.
>       >
>       > I get some weird results when I use the lm function with weight. The
> issue can
>       > be reproduced by the example below:
>       >
>       >
>       > The input data is (weights are intentionally designed to reflect some
>       > structures in the data)
>       >
>       >
>       > > df
>       > y x weight
>       >  1.51156139  0.55209240 2.117337e-34
>       > -0.63653132 -0.12599316 2.117337e-34
>       >  0.37782776  0.42095384 4.934135e-31
>       >  3.03792318  1.40315446 2.679495e-24
>       >  1.53646523  0.46076858 2.679495e-24
>       > -2.37727874 -0.73963576 6.244160e-21
>       >  0.37183065  0.20407468 1.455107e-17
>       > -1.53917553 -0.95519361 1.455107e-17
>       >  1.10926675  0.03897129 3.390908e-14
>       > -0.37786333 -0.17523593 3.390908e-14
>       >  2.43973603  0.97970095 7.902000e-11
>       > -0.35432394 -0.03742559 7.902000e-11
>       >  2.19296613  1.00355263 4.289362e-04
>       >  0.49845532  0.34816207 4.289362e-04
>       >  1.25005260  0.76306225 5.000000e-01
>       >  0.84360691  0.45152356 5.000000e-01
>       >  0.29565993  0.53880068 5.000000e-01
>       > -0.54081334 -0.28104525 5.000000e-01
>       >  0.83612836 -0.12885659 9.995711e-01
>       > -1.42526769 -0.87107631 9.999998e-01
>       >  0.10204789 -0.11649899 1.000000e+00
>       >  1.14292898  0.37249631 1.000000e+00
>       > -3.02942081 -1.28966997 1.000000e+00
>       > -1.37549764 -0.74676145 1.000000e+00
>       > -2.00118016 -0.55182759 1.000000e+00
>       > -4.24441674 -1.94603608 1.000000e+00
>       >  1.17168144  1.00868008 1.000000e+00
>       >  2.64007761  1.26333069 1.000000e+00
>       >  1.98550114  1.18509599 1.000000e+00
>       > -0.58941683 -0.61972416 9.999998e-01
>       > -4.57559611 -2.30914920 9.995711e-01
>       > -0.82610544 -0.39347576 9.995711e-01
>       > -0.02768220  0.20076910 9.995711e-01
>       >  0.78186399  0.25690215 9.995711e-01
>       > -0.88314153 -0.20200148 5.000000e-01
>       > -4.17076452 -2.03547588 5.000000e-01
>       >  0.93373070  0.54190626 4.289362e-04
>       > -0.08517734  0.17692491 4.289362e-04
>       > -4.47546619 -2.14876688 4.289362e-04
>       > -1.65509103 -0.76898087 4.289362e-04
>       > -0.39403030 -0.12689705 4.289362e-04
>       >  0.01203300 -0.18689898 1.841442e-07
>       > -4.82762639 -2.31391121 1.841442e-07
>       > -0.72658380 -0.39751171 3.397282e-14
>       > -2.35886866 -1.01082109 0.000000e+00
>       > -2.03762707 -0.96439902 0.000000e+00
>       >  0.90115123  0.60172286 0.000000e+00
>       >  1.55999194  0.83433953 0.000000e+00
>       >  3.07994058  1.30942776 0.000000e+00
>       >  1.78871462  1.10605530 0.000000e+00
>       >
>       >
>       >
>       > Running simple linear model returns:
>       >
>       > > lm(y~x,data=df)
>       >
>       > Call:
>       > lm(formula = y ~ x, data = df)
>       >
>       > Coefficients:
>       > (Intercept)            x
>       >    -0.04173      2.03790
>       >
>       > and
>       > > max(resid(lm(y~x,data=df)))
>       > [1] 1.14046
>       >
>       >
>       > *HOWEVER if I use the weighted model then:*
>       >
>       > lm(formula = y ~ x, data = df, weights = df$weights)
>       >
>       > Coefficients:
>       > (Intercept)            x
>       >    -0.05786      1.96087
>       >
>       > and
>       > > max(resid(lm(y~x,data=df,weights=df$weights)))
>       > [1] 60.91888
>       >
>       >
>       > as you see, the estimation of the coefficients are nearly the same
> but the
>       > resid() function returns a giant residual (I have some cases where
> the value is
>       > much much higher). Further, if I calculate the residuals by simply
>       > predict(lm(y~x,data=df,weights=df$weights))-df$y then I get the true
> value for
>       > the residuals.
>       >
>       >
>       > Thanks.
>       >
>       > Please do not hesitate to contact me for more details.
>       > Regards,
>       > Hamed.
>       >
>       >       [[alternative HTML version deleted]]
>       >
>       > ______________________________________________
>       > R-help@r-project.org <mailto:R-help@r-project.org>  mailing list --
> To UNSUBSCRIBE and more, see
>       > https://stat.ethz.ch/mailman/listinfo/r-help
>       > PLEASE do read the posting guide http://www.R-project.org/posting-
>       > guide.html
>       > and provide commented, minimal, self-contained, reproducible
> code.
> 

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Problem with lm.resid() when weights are provided

Reply via email to