>>>>> Fox, John <[email protected]>
>>>>> on Wed, 13 Sep 2017 22:45:07 +0000 writes:
> Dear Terry,
> Even the behaviour of lm() and glm() isn't entirely consistent. In both
cases, singularity results in NA coefficients by default, and these are
reported in the model summary and coefficient vector, but not in the
coefficient covariance matrix:
> ----------------
>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
> + data=longley)
>> summary(mod.lm)
> Call:
> lm(formula = Employed ~ GNP + Population + I(GNP + Population),
> data = longley)
> Residuals:
> Min 1Q Median 3Q Max
> -0.80899 -0.33282 -0.02329 0.25895 1.08800
> Coefficients: (1 not defined because of singularities)
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 88.93880 13.78503 6.452 2.16e-05 ***
> GNP 0.06317 0.01065 5.933 4.96e-05 ***
> Population -0.40974 0.15214 -2.693 0.0184 *
> I(GNP + Population) NA NA NA NA
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> Residual standard error: 0.5459 on 13 degrees of freedom
> Multiple R-squared: 0.9791, Adjusted R-squared: 0.9758
> F-statistic: 303.9 on 2 and 13 DF, p-value: 1.221e-11
>> vcov(mod.lm)
> (Intercept) GNP Population
> (Intercept) 190.0269691 0.1445617813 -2.0954381
> GNP 0.1445618 0.0001133631 -0.0016054
> Population -2.0954381 -0.0016053999 0.0231456
>> coef(mod.lm)
> (Intercept) GNP Population I(GNP + Population)
> 88.93879831 0.06317244 -0.40974292 NA
>>
>> mod.glm <- glm(Employed ~ GNP + Population + I(GNP + Population),
> + data=longley)
>> summary(mod.glm)
> Call:
> glm(formula = Employed ~ GNP + Population + I(GNP + Population),
> data = longley)
> Deviance Residuals:
> Min 1Q Median 3Q Max
> -0.80899 -0.33282 -0.02329 0.25895 1.08800
> Coefficients: (1 not defined because of singularities)
> Estimate Std. Error t value Pr(>|t|)
> (Intercept) 88.93880 13.78503 6.452 2.16e-05 ***
> GNP 0.06317 0.01065 5.933 4.96e-05 ***
> Population -0.40974 0.15214 -2.693 0.0184 *
> I(GNP + Population) NA NA NA NA
> ---
> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
> (Dispersion parameter for gaussian family taken to be 0.2980278)
> Null deviance: 185.0088 on 15 degrees of freedom
> Residual deviance: 3.8744 on 13 degrees of freedom
> AIC: 30.715
> Number of Fisher Scoring iterations: 2
>> coef(mod.glm)
> (Intercept) GNP Population I(GNP + Population)
> 88.93879831 0.06317244 -0.40974292 NA
>> vcov(mod.glm)
> (Intercept) GNP Population
> (Intercept) 190.0269691 0.1445617813 -2.0954381
> GNP 0.1445618 0.0001133631 -0.0016054
> Population -2.0954381 -0.0016053999 0.0231456
> ----------------
> Moreoever, lm() has a singular.ok() argument that defaults to TRUE, but
glm() doesn't have this argument:
> ----------------
>> mod.lm <- lm(Employed ~ GNP + Population + I(GNP + Population),
> + data=longley, singular.ok=FALSE)
> Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
> singular fit encountered
> ----------------
> In my opinion, singularity should at least produce a warning, both in
calls to lm() and glm(), and in summary() output. Even better, again in my
opinion, would be to produce an error by default in this situation, but doing
so would likely break too much existing code.
Yes, I would not want to change. Note that this is from S
already, i.e., long "ingrained". I think there one argument was
that there are situations with factor predictors of many levels
and conceptually their 2- or even 3-way interactions (!)
where it is neat to just fit the model, (-> get residuals and
fitted values) and also see implicitly the "necessary rank" of
prediction space, or rather even more specifically, you see for
every factor how many levels are "distinguishable"/useful for
prediction, given the data.
> I prefer NA to 0 for the redundant coefficients because it at least
suggests that the decision about what to exclude is arbitrary, and of course
simply excluding coefficients isn't the only way to proceed.
I'm less modest and would say *definitely*, NA's are highly
prefered in such a situation.
> Finally, the differences in behaviour between coef() and vcov() and
between lm() and glm() aren't really sensible.
I really haven't seen any difference between lm() and glm() in
the example above. Maybe you can point them out for me.
I do quite agree that vcov() should be compatible with
coef() [and summary()] for both 'lm' and 'glm' methods, i.e.,
should get NA rows and columns there. This would require
eliminating these before e.g. using it in solve(<vcov>, *) etc,
but I think it would be a good idea that the useR must deal with
these NAs actively.
Shall "we" try and see the fallout in CRAN space?
> Maybe there's some reason for all this that escapes me.
(for the first one---"no error"--- I gave a reason)
> Best,
> John
> --------------------------------------
> John Fox, Professor Emeritus
> McMaster University
> Hamilton, Ontario, Canada
> Web: socserv.mcmaster.ca/jfox
>> -----Original Message-----
>> From: R-devel [mailto:[email protected]] On Behalf Of
>> Therneau, Terry M., Ph.D.
>> Sent: Wednesday, September 13, 2017 6:19 PM
>> To: [email protected]
>> Subject: [Rd] vcov and survival
>>
>> I have just noticed a difference in behavior between coxph and lm/glm:
>> if one or more of the coefficients from the fit in NA, then lm and glm
>> omit that row/column from the variance matrix; while coxph retains it
>> but sets the values to zero.
>>
>> Is this something that should be "fixed", i.e., made to agree? I
>> suspect that doing so will break other packages, but then NA coefs are
>> rather rare so perhaps not.
>>
>> Terry Therneau
>>
>> ______________________________________________
>> [email protected] mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> ______________________________________________
> [email protected] mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel