[R] mgcv gam bs=re questions

William Shadish Sat, 03 May 2014 12:53:34 -0700

Dear R-helpers,

I am working on a project assessing the prevalence and variance (randomeffects) of linear and nonlinear trends in a data set that has shorttime series (each time series identified as PID 1 through 5). I am usingmgcv (gam) with the bs=”re” option (more on why not gamm or gamm4 at theend). I want to know whether trend is linear or nonlinear in each timeseries, and whether trend varies significantly over time series (e.g.,PID 1 through 5 below)--e.g., whether they have significantly differentedfs. The analysis predicts a count outcome (DVY) from time (SessIDX)and treatment (TX), using a quasipoisson approach to correct standarderrors for overdispersion. The dataset in the examples below (SID5DVID1)has 5 time series (PID 1 through 5) with 93, 97, 70, 86, and 103 timepoints each, respectively. E.g.,


JID SID PID DVID SessIDX DVY TX
1     1   5   1    1       1  23  0
2     1   5   1    1       2  20  0
3     1   5   1    1       3  30  0
4     1   5   1    1       4  40  0

…

434   1   5   5    1     100   5  1
435   1   5   5    1     101   3  1
436   1   5   5    1     102  12  1
437   1   5   5    1     103   7  1
>

My question concerns how to interpret the random effects that resultfrom variations in the model. A model with smoothed trend and no randomeffect yields


M1 <- gam(DVY  ~ s(SessIDX) + factor(TX),
+            data = SID5DVID1,
+            family = quasipoisson(link="log"), method="REML")

Approximate significance of smooth terms:
             edf Ref.df     F  p-value
s(SessIDX) 1.876  2.403 11.52 4.34e-06 ***

So trend over all five time series is linear and significant. If I allowtrend to interact with PID (be different for each time series), withouta random effect, I get


M1a <- gam(DVY  ~ s(SessIDX, by = factor(PID)) + factor(TX),
           data = SID5DVID1,
           family = quasipoisson(link="log"), method="REML")
summary(M1a,all.p=TRUE)

Approximate significance of smooth terms:
                          edf Ref.df      F  p-value
s(SessIDX):factor(PID)1 1.001  1.001 31.827 2.98e-08 ***
s(SessIDX):factor(PID)2 1.000  1.000  2.658 0.103724
s(SessIDX):factor(PID)3 5.100  5.886  8.674 9.42e-09 ***
s(SessIDX):factor(PID)4 1.337  1.602  9.822 0.000366 ***
s(SessIDX):factor(PID)5 3.998  4.924  7.967 4.40e-07 ***

It appears trend may be quite different for each time series. So now Iwant to test whether trend varies significantly over time series usingbs=”re”. One model is


M2 <- gam(DVY  ~ s(SessIDX, bs = "re") + factor(TX),
           data = SID5DVID1,
           family = quasipoisson(link="log"), method="REML")
summary(M2,all.p=TRUE)
gam.vcomp(M2)

Approximate significance of smooth terms:
              edf Ref.df     F  p-value
s(SessIDX) 0.9609      1 24.48 6.53e-07 ***
…
Standard deviations and 0.95 confidence intervals:
              std.dev      lower      upper
s(SessIDX) 0.01470809 0.00347556 0.06224258
scale      3.20933040 3.00277692 3.43009218

Rank: 2/2
>

So the random effect std.dev is 0.01470809, and it is significant withp-value = 6.53e-07. I had originally thought that this meant that trendvaried significantly over the five time series. However, after havingread a recent article (Gary J. McKeown and Ian Sneddon, DOI:10.1037/a0034282), I am not sure I am correctly interpreting this. Intheir article, this approach (a) changes the shape of the nonlineartrend only slightly compared to not using random effects [i.e., justusing s(SessIDX)] but still produces only one trend line for all timeseries, and (b) increases the width of the confidence bands around thetrend line (which of course makes perfect sense). In other words, itdoes not obviously allow each time series (PID) to have a quitedifferent edf or trend line. So, parallel with M1a above, I can run:


M5a <- gam(DVY  ~ s(SessIDX, by = factor(PID), bs="re") + factor(TX),
           data = SID5DVID1,
           family = quasipoisson(link="log"), method="REML")
summary(M5a,all.p=TRUE)
gam.vcomp(M5a)

Approximate significance of smooth terms:
                           edf Ref.df      F  p-value
s(SessIDX):factor(PID)1 0.9239      1  38.76 0.000368 ***
s(SessIDX):factor(PID)2 0.9892      1 107.00  < 2e-16 ***
s(SessIDX):factor(PID)3 0.9867      1  94.85  < 2e-16 ***
s(SessIDX):factor(PID)4 0.9830      1 108.33 1.64e-13 ***
s(SessIDX):factor(PID)5 0.9658      1  73.41 1.32e-07 ***

Standard deviations and 0.95 confidence intervals:

                            std.dev       lower      upper
s(SessIDX):factor(PID)1 0.008853821 0.001960054 0.03999387
s(SessIDX):factor(PID)2 0.065276527 0.016064261 0.26524874
s(SessIDX):factor(PID)3 0.048967099 0.012003711 0.19975296
s(SessIDX):factor(PID)4 0.027830597 0.006777380 0.11428341
s(SessIDX):factor(PID)5 0.012218325 0.002893741 0.05158977
scale                   2.559988984 2.394424050 2.73700208

This seems to produce five random effects, all apparently significant.But it is not clear to me how to interpret these random effects. Forexample, consider s(SessIDX):factor(PID)1 0.008853821 above. Clearlythis is not what I want—it is not a measure of whether the five casesdiffer significantly from each other in trend. I am guessing this randomeffect concerns whether time points within PID1 differ significantlyfrom each other?

One might, of course, use gamm or gamm4 to address the random effectsquestion. I had hoped, however, to run all analyses within the sameprogram (gam) to avoid confounding differences in results withdifferences in the estimation routines that different programs use.

Running it all in gam, however, encounters another problem, that it doesnot appear that gam can use bs=”re” while fixing the trend to be linear.For example:


> M4 <- gam(DVY  ~ s(SessIDX, bs = "re", k = 2, fx=TRUE) +
+                factor(TX),
+            data = SID5DVID1,
+            family = quasipoisson(link="log"), method="REML")
> summary(M4,all.p=TRUE)
Error in b$smooth[[m]]$S[[1]] : subscript out of bounds
> gam.vcomp(M4)
Error in if (x$smooth[[i]]$sp[j] < 0) { : argument is of length zero
>

I have read extensively to try to understand all this (e.g., Wood 2013“A simple test for random effects in regression models”) but have notsucceeded. Any help or advice would be appreciated very very much.


Will Shadish

--
William R. Shadish
Distinguished Professor
Founding Faculty

Mailing Address:
William R. Shadish
University of California
School of Social Sciences, Humanities and Arts
5200 North Lake Rd
Merced CA  95343

Physical/Delivery Address:
University of California Merced
ATTN: William Shadish
School of Social Sciences, Humanities and Arts
Facilities Services Building A
5200 North Lake Rd.
Merced, CA 95343

209-228-4372 voice
209-228-4007 fax (communal fax: be sure to include cover sheet)
wshad...@ucmerced.edu
http://faculty.ucmerced.edu/wshadish/index.htm
http://psychology.ucmerced.edu

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] mgcv gam bs=re questions

Reply via email to