Dear R-helpers,

I am working on a project assessing the prevalence and variance (random effects) of linear and nonlinear trends in a data set that has short time series (each time series identified as PID 1 through 5). I am using mgcv (gam) with the bs=”re” option (more on why not gamm or gamm4 at the end). I want to know whether trend is linear or nonlinear in each time series, and whether trend varies significantly over time series (e.g., PID 1 through 5 below)--e.g., whether they have significantly different edfs. The analysis predicts a count outcome (DVY) from time (SessIDX) and treatment (TX), using a quasipoisson approach to correct standard errors for overdispersion. The dataset in the examples below (SID5DVID1) has 5 time series (PID 1 through 5) with 93, 97, 70, 86, and 103 time points each, respectively. E.g.,

JID SID PID DVID SessIDX DVY TX
1     1   5   1    1       1  23  0
2     1   5   1    1       2  20  0
3     1   5   1    1       3  30  0
4     1   5   1    1       4  40  0

…

434   1   5   5    1     100   5  1
435   1   5   5    1     101   3  1
436   1   5   5    1     102  12  1
437   1   5   5    1     103   7  1
>

My question concerns how to interpret the random effects that result from variations in the model. A model with smoothed trend and no random effect yields

M1 <- gam(DVY  ~ s(SessIDX) + factor(TX),
+            data = SID5DVID1,
+            family = quasipoisson(link="log"), method="REML")

Approximate significance of smooth terms:
             edf Ref.df     F  p-value
s(SessIDX) 1.876  2.403 11.52 4.34e-06 ***

So trend over all five time series is linear and significant. If I allow trend to interact with PID (be different for each time series), without a random effect, I get

M1a <- gam(DVY  ~ s(SessIDX, by = factor(PID)) + factor(TX),
           data = SID5DVID1,
           family = quasipoisson(link="log"), method="REML")
summary(M1a,all.p=TRUE)

Approximate significance of smooth terms:
                          edf Ref.df      F  p-value
s(SessIDX):factor(PID)1 1.001  1.001 31.827 2.98e-08 ***
s(SessIDX):factor(PID)2 1.000  1.000  2.658 0.103724
s(SessIDX):factor(PID)3 5.100  5.886  8.674 9.42e-09 ***
s(SessIDX):factor(PID)4 1.337  1.602  9.822 0.000366 ***
s(SessIDX):factor(PID)5 3.998  4.924  7.967 4.40e-07 ***

It appears trend may be quite different for each time series. So now I want to test whether trend varies significantly over time series using bs=”re”. One model is

M2 <- gam(DVY  ~ s(SessIDX, bs = "re") + factor(TX),
           data = SID5DVID1,
           family = quasipoisson(link="log"), method="REML")
summary(M2,all.p=TRUE)
gam.vcomp(M2)

Approximate significance of smooth terms:
              edf Ref.df     F  p-value
s(SessIDX) 0.9609      1 24.48 6.53e-07 ***
…
Standard deviations and 0.95 confidence intervals:
              std.dev      lower      upper
s(SessIDX) 0.01470809 0.00347556 0.06224258
scale      3.20933040 3.00277692 3.43009218

Rank: 2/2
>

So the random effect std.dev is 0.01470809, and it is significant with p-value = 6.53e-07. I had originally thought that this meant that trend varied significantly over the five time series. However, after having read a recent article (Gary J. McKeown and Ian Sneddon, DOI: 10.1037/a0034282), I am not sure I am correctly interpreting this. In their article, this approach (a) changes the shape of the nonlinear trend only slightly compared to not using random effects [i.e., just using s(SessIDX)] but still produces only one trend line for all time series, and (b) increases the width of the confidence bands around the trend line (which of course makes perfect sense). In other words, it does not obviously allow each time series (PID) to have a quite different edf or trend line. So, parallel with M1a above, I can run:

M5a <- gam(DVY  ~ s(SessIDX, by = factor(PID), bs="re") + factor(TX),
           data = SID5DVID1,
           family = quasipoisson(link="log"), method="REML")
summary(M5a,all.p=TRUE)
gam.vcomp(M5a)

Approximate significance of smooth terms:
                           edf Ref.df      F  p-value
s(SessIDX):factor(PID)1 0.9239      1  38.76 0.000368 ***
s(SessIDX):factor(PID)2 0.9892      1 107.00  < 2e-16 ***
s(SessIDX):factor(PID)3 0.9867      1  94.85  < 2e-16 ***
s(SessIDX):factor(PID)4 0.9830      1 108.33 1.64e-13 ***
s(SessIDX):factor(PID)5 0.9658      1  73.41 1.32e-07 ***

Standard deviations and 0.95 confidence intervals:

                            std.dev       lower      upper
s(SessIDX):factor(PID)1 0.008853821 0.001960054 0.03999387
s(SessIDX):factor(PID)2 0.065276527 0.016064261 0.26524874
s(SessIDX):factor(PID)3 0.048967099 0.012003711 0.19975296
s(SessIDX):factor(PID)4 0.027830597 0.006777380 0.11428341
s(SessIDX):factor(PID)5 0.012218325 0.002893741 0.05158977
scale                   2.559988984 2.394424050 2.73700208

This seems to produce five random effects, all apparently significant. But it is not clear to me how to interpret these random effects. For example, consider s(SessIDX):factor(PID)1 0.008853821 above. Clearly this is not what I want—it is not a measure of whether the five cases differ significantly from each other in trend. I am guessing this random effect concerns whether time points within PID1 differ significantly from each other?

One might, of course, use gamm or gamm4 to address the random effects question. I had hoped, however, to run all analyses within the same program (gam) to avoid confounding differences in results with differences in the estimation routines that different programs use.

Running it all in gam, however, encounters another problem, that it does not appear that gam can use bs=”re” while fixing the trend to be linear. For example:

> M4 <- gam(DVY  ~ s(SessIDX, bs = "re", k = 2, fx=TRUE) +
+                factor(TX),
+            data = SID5DVID1,
+            family = quasipoisson(link="log"), method="REML")
> summary(M4,all.p=TRUE)
Error in b$smooth[[m]]$S[[1]] : subscript out of bounds
> gam.vcomp(M4)
Error in if (x$smooth[[i]]$sp[j] < 0) { : argument is of length zero
>

I have read extensively to try to understand all this (e.g., Wood 2013 “A simple test for random effects in regression models”) but have not succeeded. Any help or advice would be appreciated very very much.

Will Shadish

--
William R. Shadish
Distinguished Professor
Founding Faculty

Mailing Address:
William R. Shadish
University of California
School of Social Sciences, Humanities and Arts
5200 North Lake Rd
Merced CA  95343

Physical/Delivery Address:
University of California Merced
ATTN: William Shadish
School of Social Sciences, Humanities and Arts
Facilities Services Building A
5200 North Lake Rd.
Merced, CA 95343

209-228-4372 voice
209-228-4007 fax (communal fax: be sure to include cover sheet)
wshad...@ucmerced.edu
http://faculty.ucmerced.edu/wshadish/index.htm
http://psychology.ucmerced.edu

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to