Re: [R] mgcv (bam) very large standard error difference between versions 1.7-11 and 1.7-17, bug?

Simon Wood Mon, 11 Jun 2012 04:31:08 -0700

Hi Martijn,

The issue in this case is that the estimates for the terms are ratherdifferent (not just the p-values). I think that the difference is downto this change made in version 1.7-13 (from changeLog)...


* "re" smooths are no longer subject to side constraint under nesting
  (since this is almost always un-necessary and undesirable, and often
  unexpected).

... in 1.7-11 "re" terms were checked for identifiability under nestingin exactly the same way as any other smooth. This led to the impositionof "identifiability" constraints in your model, since you have nestedrandom effects. However "re" terms are a special case in this context,and it is usually undesirable to impose such constraints, so fromversion 1.7-13 onwards "re" terms are excluded when identifiabilityconstraints are imposed.


best,
Simon


On 11/06/12 09:47, Martijn Wieling wrote:

Hi Simon,

Thanks for your reply. That is very helpful.

However, in the logistic regression example, there also appears to be
a large difference in the p-values and estimates when comparing
summary(model) # mgcv 1.7-11
summary(model,freq=F) # mgcv 1.7-17

These should be the same, right? But why is this not the case? The
results are shown below, and are also explained in this post:
http://r.789695.n4.nabble.com/mgcv-bam-very-large-standard-error-difference-between-versions-1-7-11-and-1-7-17-bug-tp4632173p4632491.html)

With kind regards,
Martijn

# call in all versions
model = bam(NoStandard3 ~
te(GeoX,GeoY,WordFreqLog.z,by=OldSpeakerContrast,d=c(2,1)) +
OldSpeakerContrast + PopCntLog.z + s(Word,bs="re") +
s(Placename,bs="re") + s(Word,PopAvgIncomeLog.z,bs="re") +
s(Word,PopAvgAgeLog_residIncome.z,bs="re") +
s(Word,PopCntLog.z,bs="re") +
s(Word,YearRec.z,bs="re"),data=lexdst,family="binomial",method="REML")

### mgcv, version 1.7-11
#Family: binomial
#Link function: logit
#
#Formula:
#NoStandard3 ~ te(GeoX, GeoY, WordFreqLog.z, by = OldSpeakerContrast,
#    d = c(2, 1)) + OldSpeakerContrast + PopCntLog.z + s(Word,
#    bs = "re") + s(Placename, bs = "re") + s(Word, PopAvgIncomeLog.z,
#    bs = "re") + s(Word, PopAvgAgeLog_residIncome.z, bs = "re") +
#    s(Word, PopCntLog.z, bs = "re") + s(Word, YearRec.z, bs = "re")
#
#Parametric coefficients:
#                    Estimate Std. Error z value Pr(>|z|)
#(Intercept)         -0.41610    0.14027  -2.966  0.00301 **
#OldSpeakerContrast1  0.44508    0.01934  23.009<  2e-16 ***
#PopCntLog.z         -0.11673    0.02725  -4.284 1.83e-05 ***
#---
#Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
#
#Approximate significance of smooth terms:
#                                                   edf Ref.df  Chi.sq  p-value
#te(GeoX,GeoY,WordFreqLog.z):OldSpeakerContrast0  54.05  69.06   223.3<  2e-16
#te(GeoX,GeoY,WordFreqLog.z):OldSpeakerContrast1  59.17  74.83   327.6<  2e-16
#s(Word)                                         166.41 167.84 13756.0<  2e-16
#s(Placename)                                    157.64 188.02   692.7<  2e-16
#s(Word,PopAvgIncomeLog.z)                       137.83 160.05   829.9<  2e-16
#s(Word,PopAvgAgeLog_residIncome.z)              121.13 151.13   435.5<  2e-16
#s(Word,PopCntLog.z)                              97.12 133.97   205.5 7.01e-05
#s(Word,YearRec.z)                               128.98 155.27   585.7<  2e-16
#
#R-sq.(adj) =  0.372   Deviance explained = 32.4%
#REML score =  97450  Scale est. = 1         n = 69259

### mgcv 1.7-17, summary(model,freq=F)
#Family: binomial
#Link function: logit
#
#Formula:
#NoStandard3 ~ te(GeoX, GeoY, WordFreqLog.z, by = OldSpeakerContrast,
#    d = c(2, 1)) + OldSpeakerContrast + PopCntLog.z + s(Word,
#    bs = "re") + s(Placename, bs = "re") + s(Word, PopAvgIncomeLog.z,
#    bs = "re") + s(Word, PopAvgAgeLog_residIncome.z, bs = "re") +
#    s(Word, PopCntLog.z, bs = "re") + s(Word, YearRec.z, bs = "re")
#
#Parametric coefficients:
#                    Estimate Std. Error z value Pr(>|z|)
#(Intercept)         -0.98150    0.52485  -1.870   0.0615 .
#OldSpeakerContrast1  0.65497    0.68628   0.954   0.3399
#PopCntLog.z         -0.11724    0.02726  -4.300 1.71e-05 ***
#---
#Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1
#
#Approximate significance of smooth terms:
#                                                   edf Ref.df  Chi.sq p-value
#te(GeoX,GeoY,WordFreqLog.z):OldSpeakerContrast0  53.62  68.48   221.4<2e-16
#te(GeoX,GeoY,WordFreqLog.z):OldSpeakerContrast1  58.64  74.14   324.9<2e-16
#s(Word)                                         166.41 168.00 14351.7<2e-16
#s(Placename)                                    157.79 209.00  1160.0<2e-16
#s(Word,PopAvgIncomeLog.z)                       137.84 170.00  1062.1<2e-16
#s(Word,PopAvgAgeLog_residIncome.z)              121.15 170.00   686.7<2e-16
#s(Word,PopCntLog.z)                              97.10 169.00   434.0<2e-16
#s(Word,YearRec.z)                               128.98 170.00   867.4<2e-16
#
#R-sq.(adj) =  0.372   Deviance explained = 32.4%
#REML score =  97449  Scale est. = 1         n = 69259


Met vriendelijke groet,
Martijn

--
*******************************************
Martijn Wieling
http://www.martijnwieling.nl
wiel...@gmail.com
+31(0)614108622
*******************************************
University of Groningen
http://www.rug.nl/staff/m.b.wieling
*******************************************


On Mon, Jun 11, 2012 at 10:34 AM, Simon Wood<s.w...@bath.ac.uk>  wrote:

Dear Martijn,

You are right that the change in the summary defaults was not a sensible
thing to do, and I'll change it back. However in the mean time you can use
1.7-17 with summary(...,freq=FALSE), when you have random effects in the
model (it really shouldn't make a statistically meaningful difference
otherwise).

I made the change because it leads to better p-value performance when
parametric model component are penalized using gam's `paraPen' argument, but
as you have demonstrated this then uses a fixed effects distribution that is
not really appropriate when there are random effects present. I should
probably just deal with the paraPen case in the paraPen documentation.

Technically the difference here is that with freq=TRUE the covariance matrix
for the model coefficient estimators/predictions is taken to be

  (X'X+S)^{-1}X'X(X'X+S)^{-1}\sigma^2

where X is the model matrix and S the penalty matrix for the model (Gaussian
case, non-Gaussian also has weights). This is based on the fact that the
coefficient estimates/predictions are
  \hat \beta = (X'X+S)^{-1}X'y
where y is the response data, and y_i iid N(0,\sigma^2).

When freq=FALSE the covariance matrix is the one that comes from the
Bayesian/random effects view of the smoothing process, and is

  (X'X+S)^{-1}\sigma^2

The latter is usually what you want when you have true random effects in the
model (rather than smooths which are usually not in the model as true random
effects).

best,
Simon



On 03/06/12 18:45, Martijn Wieling wrote:


Dear useRs,

I've ran some additional analyses (see below), which strongly suggest
the standard errors of the bam (and gam) function are much too low in
mgcv version 1.7-17, at least when including an s(X,bs="re") term.
Until this issue has been clarified, it's perhaps best to use an older
version of mgcv (unfortunately, however, in earlier versions the
p-value calculation of s(X,bs="re") is not correct). All analyses were
conducted in R 2.15.0.

My approach was the following: I created a mixed-effects regression
model with a single random intercept and only linear predictors. In my
view, the results using lmer (lme4) should be comparable to those of
bam and gam (mgcv). This was the case when using an older version of
mgcv (version 1.7-13), but this is not the case anymore in version
1.7-17. In version 1.7-17, the standard errors and p-values are much
lower and very similar to those of a linear model (which does not take
the random-effects structure into account). The R-code and results are
shown below. (The results using gam are not shown, but show the same
pattern.)

Furthermore, note that the differences in standard errors become less
severe (but still noticeable) when less data is involved (e.g., using
only 500 rows as opposed to>100.000). Finally, when not including an
s(X,bs="re") term, but another non-random-effect smooth, the standard
errors do not appear to be structurally lower (only for some
variables, but not by a great deal - see also below).

With kind regards,
Martijn Wieling
University of Groningen

#### lme4 model (most recent version of lme4)
modelLMER<- lmer(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
SpIsMale + (1|Key), data=wrddst)
#                        Estimate Std. Error t value
#SpYearBirth.z          -0.012084   0.004577  -2.640
#IsAragon                0.138959   0.010040  13.840
#SpIsMale               -0.003087   0.008290  -0.372
#SpYearBirth.z:IsAragon  0.015429   0.010159   1.519


#### mgcv 1.7-13, default (method = "REML") - almost identical to
modelLMER
modelBAMold<- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
SpIsMale + s(Key,bs="re"), data=wrddst)
#                        Estimate Std. Error t value Pr(>|t|)
#SpYearBirth.z          -0.012084   0.004578  -2.640  0.00829 **
#IsAragon                0.138959   0.010042  13.838<    2e-16 ***
#SpIsMale               -0.003087   0.008292  -0.372  0.70968
#SpYearBirth.z:IsAragon  0.015429   0.010160   1.519  0.12886


#### mgcv 1.7-17, method = "REML" - standard errors greatly reduced
# (comparable to standard errors of LM without random intercept)
modelBAMnew<- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
SpIsMale + s(Key,bs="re"), data=wrddst); print(testje,cor=F)
#                        Estimate Std. Error t value Pr(>|t|)
#SpYearBirth.z          -0.012084   0.001159 -10.428<    2e-16 ***
#IsAragon                0.138959   0.002551  54.472<    2e-16 ***
#SpIsMale               -0.003087   0.002098  -1.471    0.141
#SpYearBirth.z:IsAragon  0.015429   0.002587   5.965 2.45e-09 ***

#### lm results, standard errors comparable to mgcv 1.7-17
modelLM<- lm(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon + SpIsMale,
data=wrddst)
#                        Estimate Std. Error t value Pr(>|t|)
#(Intercept)            -0.025779   0.001653 -15.595<    2e-16 ***
#SpYearBirth.z          -0.011906   0.001182 -10.070<    2e-16 ***
#IsAragon                0.139323   0.002603  53.531<    2e-16 ***
#SpIsMale               -0.003076   0.002140  -1.437    0.151
#SpYearBirth.z:IsAragon  0.015252   0.002639   5.780 7.49e-09 ***


#### mgcv 1.7-17, default (method = "fREML") - completely different
from previous models
modelBAMfREML<- bam(RefPMIdistMeanLog.c ~ SpYearBirth.z*IsAragon +
SpIsMale + s(Key,bs="re"), data=wrddst); print(testje,cor=F)
#                        Estimate Std. Error t value Pr(>|t|)
#(Intercept)            -0.025391   0.106897  -0.238    0.812
#SpYearBirth.z          -0.012084   0.076300  -0.158    0.874
#IsAragon                0.138959   0.166697   0.834    0.405
#SpIsMale               -0.003087   0.138291  -0.022    0.982
#SpYearBirth.z:IsAragon  0.015429   0.168260   0.092    0.927
#
#Approximate significance of smooth terms:
#          edf Ref.df     F p-value
#s(Key) -38.95    310 15.67<2e-16 ***


#### differences w.r.t. standard smooths
#### mgcv version 1.7-13
m2old<- bam(RefPMIdistMeanLog.c ~ s(GeoX,GeoY) +
SpYearBirth.z*IsAragon + SpIsMale, data=wrddst, method="REML")
## RESULTS
#Family: gaussian
#Link function: identity
#
#Formula:
#RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + SpYearBirth.z * IsAragon +
#    SpIsMale
#
#Parametric coefficients:
#                        Estimate Std. Error t value Pr(>|t|)
#(Intercept)            -0.001386   0.004982  -0.278   0.7809
#SpYearBirth.z          -0.012950   0.001167 -11.097<    2e-16 ***
#IsAragon                0.020532   0.023608   0.870   0.3845
#SpIsMale               -0.004788   0.002219  -2.158   0.0309 *
#SpYearBirth.z:IsAragon  0.015611   0.002600   6.005 1.92e-09 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Approximate significance of smooth terms:
#               edf Ref.df     F p-value
#s(GeoX,GeoY) 27.11  28.14 126.2<2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#R-sq.(adj) =  0.0555   Deviance explained = 5.58%
#REML score =  39232  Scale est. = 0.11734   n = 112608


#### mgcv version 1.7-17
m2new<- bam(RefPMIdistMeanLog.c ~ s(GeoX,GeoY) +
SpYearBirth.z*IsAragon + SpIsMale, data=wrddst, method="REML")
#Family: gaussian
#Link function: identity
#
#Formula:
#RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + SpYearBirth.z * IsAragon +
#    SpIsMale
#
#Parametric coefficients:
#                        Estimate Std. Error t value Pr(>|t|)
#(Intercept)            -0.001388   0.003938  -0.352   0.7245
#SpYearBirth.z          -0.012950   0.001167 -11.098<    2e-16 ***
#IsAragon                0.020543   0.018055   1.138   0.2552
#SpIsMale               -0.004788   0.002215  -2.161   0.0307 *
#SpYearBirth.z:IsAragon  0.015611   0.002600   6.005 1.92e-09 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#Approximate significance of smooth terms:
#               edf Ref.df     F p-value
#s(GeoX,GeoY) 27.11  28.14 126.2<2e-16 ***
#---
#Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
#R-sq.(adj) =  0.0555   Deviance explained = 5.58%
#REML score =  39232  Scale est. = 0.11734   n = 112608


On Sat, Jun 2, 2012 at 6:25 PM, Martijn Wieling<wiel...@gmail.com>    wrote:


Dear useRs,

I reran an analysis with bam (mgcv, version 1.7-17) originally
conducted using an older version of bam (mgcv, version 1.7-11) and
this resulted in the same estimates, but much lower standard errors
(in some cases 20 times as low) and lower p-values. This obviously
results in a larger set of significant predictors. Is this result
expected given the improvements in the new version? Or this a bug and
are the p-values of bam in mgcv 1.7-17 too low? The summaries of both
versions are shown below to enable a comparison.

In addition, applying the default method="fREML" (mgcv version 1.7-17)
on the same dataset yields only non-significant results, while all
results are highly significant using method="REML". Furthermore, it
also results in large negative (e.g., -8757) edf values linked to
s(X,bs="RE") terms. Is this correct, or is this a bug? The summary of
the model using method="fREML" is also shown below.

I hope someone can shed some light on this.

With kind regards,
Martijn Wieling,
University of Groningen

#################################
### mgcv version 1.7-11
#################################

Family: gaussian
Link function: identity

Formula:
RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z +
IsSemiwordOrDemonstrative +
    RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
    s(Word, bs = "re") + s(Key, bs = "re")

Parametric coefficients:
                           Estimate Std. Error t value Pr(>|t|)
(Intercept)               -0.099757   0.020234  -4.930 8.23e-07 ***
RefVratio.z                0.105705   0.013328   7.931 2.19e-15 ***
IsSemiwordOrDemonstrative  0.289828   0.046413   6.245 4.27e-10 ***
RefSoundCnt.z              0.119981   0.021202   5.659 1.53e-08 ***
SpYearBirth.z             -0.011396   0.002407  -4.734 2.21e-06 ***
IsAragon                   0.055678   0.033137   1.680  0.09291 .
PopCntLog_residGeo.z      -0.006504   0.003279  -1.984  0.04731 *
SpYearBirth.z:IsAragon     0.015871   0.005459   2.907  0.00365 **
---
Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1

Approximate significance of smooth terms:
                edf Ref.df      F p-value
s(GeoX,GeoY)  24.01  24.21  31.16<2e-16 ***
s(Word)      352.29 347.00 501.57<2e-16 ***
s(Key)       269.75 289.25  10.76<2e-16 ***
---
Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1

R-sq.(adj) =  0.693   Deviance explained = 69.4%
REML score = -22476  Scale est. = 0.038177  n = 112608


#################################
### mgcv version 1.7-17, much lower p-values and standard errors than
version 1.7-11
#################################

Family: gaussian
Link function: identity

Formula:
RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z +
IsSemiwordOrDemonstrative +
    RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
    s(Word, bs = "re") + s(Key, bs = "re")

Parametric coefficients:
                            Estimate Std. Error t value Pr(>|t|)
(Intercept)               -0.0997566  0.0014139 -70.552<    2e-16 ***
RefVratio.z                0.1057049  0.0006565 161.010<    2e-16 ***
IsSemiwordOrDemonstrative  0.2898285  0.0020388 142.155<    2e-16 ***
RefSoundCnt.z              0.1199813  0.0009381 127.905<    2e-16 ***
SpYearBirth.z             -0.0113956  0.0006508 -17.509<    2e-16 ***
IsAragon                   0.0556777  0.0057143   9.744<    2e-16 ***
PopCntLog_residGeo.z      -0.0065037  0.0007938  -8.193 2.58e-16 ***
SpYearBirth.z:IsAragon     0.0158712  0.0014829  10.703<    2e-16 ***
---
Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1

Approximate significance of smooth terms:
                edf Ref.df       F p-value
s(GeoX,GeoY)  24.01  24.21   31.15<2e-16 ***
s(Word)      352.29 347.00  587.26<2e-16 ***
s(Key)       269.75 313.00 4246.76<2e-16 ***
---
Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1

R-sq.(adj) =  0.693   Deviance explained = 69.4%
REML score = -22476  Scale est. = 0.038177  n = 112608


#################################
### mgcv version 1.7-17, default: method="fREML", all p-values
non-significant and negative edf's of s(X,bs="re")
#################################

Family: gaussian
Link function: identity

Formula:
RefPMIdistMeanLog.c ~ s(GeoX, GeoY) + RefVratio.z +
IsSemiwordOrDemonstrative +
    RefSoundCnt.z + SpYearBirth.z * IsAragon + PopCntLog_residGeo.z +
    s(Word, bs = "re") + s(Key, bs = "re")

Parametric coefficients:
                           Estimate Std. Error t value Pr(>|t|)
(Intercept)               -0.099757   1.730235  -0.058    0.954
RefVratio.z                0.105705   1.145329   0.092    0.926
IsSemiwordOrDemonstrative  0.289828   4.167237   0.070    0.945
RefSoundCnt.z              0.119981   1.901158   0.063    0.950
SpYearBirth.z             -0.011396   0.034236  -0.333    0.739
IsAragon                   0.055678   0.298629   0.186    0.852
PopCntLog_residGeo.z      -0.006504   0.041981  -0.155    0.877
SpYearBirth.z:IsAragon     0.015871   0.077142   0.206    0.837

Approximate significance of smooth terms:
               edf Ref.df       F p-value
s(GeoX,GeoY) -1376      1   7.823 0.00516 **
s(Word)      -8298    347 577.910<    2e-16 ***
s(Key)       -1421    316  13.512<    2e-16 ***
---
Signif. codes:  0 â***â 0.001 â**â 0.01 â*â 0.05 â.â 0.1 â â 1

R-sq.(adj) =  0.741   Deviance explained = 69.4%
fREML score = -22476  Scale est. = 0.038177  n = 112608



--
Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
+44 (0)1225 386603               http://people.bath.ac.uk/sw283



--
Simon Wood, Mathematical Science, University of Bath BA2 7AY UK
+44 (0)1225 386603               http://people.bath.ac.uk/sw283

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] mgcv (bam) very large standard error difference between versions 1.7-11 and 1.7-17, bug?

Reply via email to