> On 28 Mar 2015, at 18:28 , Ben Bolker <bbol...@gmail.com> wrote: > > peter dalgaard <pdalgd <at> gmail.com> writes: > >> >> >>> On 28 Mar 2015, at 00:32 , RiGui <raluca.gui <at> business.uzh.ch> wrote: >>> >>> Hello everybody, >>> >>> I have encountered the following problem with lm(): >>> >>> When running lm() with a regressor close to zero - >> of the order e-10, the >>> value of the estimate is of huge absolute value , of order millions. >>> >>> However, if I write the formula of the OLS estimator, >> in matrix notation: >>> pseudoinverse(t(X)*X) * t(X) * y , the results are correct, meaning the >>> estimate has value 0. > > How do you know this answer is "correct"? > >>> here is the code: >>> >>> y <- rnorm(n_obs, 10,2.89) >>> x1 <- rnorm(n_obs, 0.00000000000001235657,0.000000000000000045) >>> x2 <- rnorm(n_obs, 10,3.21) >>> X <- cbind(x1,x2) >>> >>> bFE <- lm(y ~ x1 + x2) >>> bFE >>> >>> bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y >>> bOLS >>> >>> >>> Note: I am applying a deviation from the >> mean projection to the data, that >>> is why I have some regressors with such small values. >>> >>> Thank you for any help! >>> >>> Raluca > > Is there a reason you can't scale your regressors? > >>> >> Example not reproducible: >> > > I agree that the OP's question was not reproducible, but it's > not too hard to make it reproducible. I bothered to use > library("sos"); findFn("pseudoinverse") to find pseudoinverse() > in corpcor:
Well, it shouldn't be my work, nor yours... And I thought it particularly egregious to treat a function from an unspecified package as gospel. > > It is true that we get estimates with very large magnitudes, > but their > > set.seed(101) > n_obs <- 1000 > y <- rnorm(n_obs, 10,2.89) > x1 <- rnorm(n_obs, mean=1.235657e-14,sd=4.5e-17) > x2 <- rnorm(n_obs, 10,3.21) > X <- cbind(x1,x2) > > bFE <- lm(y ~ x1 + x2) > bFE > coef(summary(bFE)) > > Estimate Std. Error t value Pr(>|t|) > (Intercept) 1.155959e+01 2.312956e+01 0.49977541 0.6173435 > x1 -1.658420e+14 1.872598e+15 -0.08856254 0.9294474 > x2 3.797342e-02 2.813000e-02 1.34992593 0.1773461 > > library("corpcor") > bOLS <- pseudoinverse(t(X) %*% X) %*% t(X) %*% y > bOLS > > [,1] > [1,] 9.807664e-16 > [2,] 8.880273e-01 > > And if we scale the predictor: > > bFE2 <- lm(y ~ I(1e14*x1) + x2) > coef(summary(bFE2)) > > Estimate Std. Error t value Pr(>|t|) > (Intercept) 11.55958731 23.12956 0.49977541 0.6173435 > I(1e+14 * x1) -1.65842037 18.72598 -0.08856254 0.9294474 > x2 0.03797342 0.02813 1.34992593 0.1773461 > > bOLS stays constant. > > To be honest, I haven't thought about this enough to see > which answer is actually correct, although I suspect the > problem is in bOLS, since the numerical methods (unlike > the brute-force pseudoinverse method given here) behind > lm have been carefully considered for numerical stability. In particular, the pseudoinverse() function has a tol= argument which allows it to zap small singular values. > pseudoinverse(crossprod(X))%*%crossprod(X,y) [,1] [1,] 9.807664e-16 [2,] 8.880273e-01 > pseudoinverse(crossprod(X),tol=1e-40)%*%crossprod(X,y) [,1] [1,] 1.286421e+15 [2,] -5.327384e-01 Also, notice that there is no intercept in the above, so it would be more reasonable to compare to > bFE <- lm(y ~ x1 + x2-1) > summary(bFE) Call: lm(formula = y ~ x1 + x2 - 1) Residuals: Min 1Q Median 3Q Max -9.1712 -1.9439 -0.0321 1.7637 9.4540 Coefficients: Estimate Std. Error t value Pr(>|t|) x1 7.700e+14 2.435e+13 31.62 <2e-16 *** x2 3.766e-02 2.811e-02 1.34 0.181 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.771 on 998 degrees of freedom Multiple R-squared: 0.9275, Adjusted R-squared: 0.9273 F-statistic: 6382 on 2 and 998 DF, p-value: < 2.2e-16 Or, maybe better to use cbind(1,X) in the pseudoinverse() method. It doesn't help, though. What really surprises me is this > X <- cbind(x1,x2) > crossprod(X) x1 x2 x1 1.526698e-25 1.265085e-10 x2 1.265085e-10 1.145462e+05 > solve(crossprod(X)) Error in solve.default(crossprod(X)) : system is computationally singular: reciprocal condition number = 1.13052e-31 > solve(crossprod(X), tol=1e-40) x1 x2 x1 7.722232e+25 -8.528686e+10 x2 -8.528686e+10 1.029237e-04 > pseudoinverse(crossprod(X), tol=1e-40) [,1] [,2] [1,] 6.222152e+25 -6.217178e+10 [2,] -6.871948e+10 7.739466e-05 How does pseudoinverse() go so badly wrong? Notice that apart from the scaling, this really isn't very ill-conditioned, but a lot of accuracy seems to be lost in its SVD step. Notice that > X <- cbind(1e14*x1,x2) > solve(crossprod(X)) x2 0.0077222325 -0.0008528686 x2 -0.0008528686 0.0001029237 > pseudoinverse(crossprod(X)) [,1] [,2] [1,] 0.0077222325 -0.0008528686 [2,] -0.0008528686 0.0001029237 -- Peter Dalgaard, Professor, Center for Statistics, Copenhagen Business School Solbjerg Plads 3, 2000 Frederiksberg, Denmark Phone: (+45)38153501 Email: pd....@cbs.dk Priv: pda...@gmail.com ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.