Juliet, for you the diagnostic plots:
just to recall: the first model was this: fit<-gam(target ~s(mgs)+s(gsd)+s(mud)+s(ssCmax),family=quasi(link=log),data=wspe1,method="REML",select=F) > summary(fit) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -4.724 7.462 -0.633 0.527 Approximate significance of smooth terms: edf Ref.df F p-value s(mgs) 3.118 3.492 0.099 0.974 s(gsd) 6.377 7.044 15.596 <2e-16 *** s(mud) 8.837 8.971 18.832 <2e-16 *** s(ssCmax) 3.886 4.051 2.342 0.052 . --- R-sq.(adj) = 0.403 Deviance explained = 40.6% REML score = 33186 Scale est. = 8.7812e+05 n = 4511 (I slightly shortened the output) Also of interest: Model error as root mean squared error (RMSE): > sqrt(mean(residuals.gam(fit,type="response")^2)) [1] 934.6647 Here are diagnostic plots: <http://r.789695.n4.nabble.com/file/n4665370/screen-capture-1.png> <http://r.789695.n4.nabble.com/file/n4665370/screen-capture-2.png> Here Simons comment to this particular model from Apr 18, 2013; 5:25pm (see above) "The p-value computations are based on the approximation that things are approximately normal on the linear predictor scale, but actually they are no where close to normal in this case, which is why the p-values look inconsistent. The reason that the approximate normality assumption doesn't hold is that the model is quite a poor fit. If you take a look at gam.check(fit) you'll see that the constant variance assumption of quasi(link=log) is violated quite badly, and the residual distribution is really quite odd (plot residuals against fitted as well). Also see plot(fit,pages=1,scale=0) - it shows ballooning confidence intervals and smooth estimates that are so low in places that they might as well be minus infinity (given log link) - clearly something is wrong with this model! " Following Simons advice (quote): "try Tweedie(p=1.5,link=log) as the family. Also the predictor variables are very skewed which is giving leverage problems, so I would transform them to give less skew. e.g. Something like " fit<-gam(target~s(log(mgs))+s(I(gsd^.5))+s(I(mud^.25))+s(log(ssCmax)), + family=Tweedie(p=1.6,link=log),data=wspe1,method="REML") > summary(fit) Parametric coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 4.02654 0.05231 76.97 <2e-16 *** Approximate significance of smooth terms: edf Ref.df F p-value s(log(mgs)) 6.067 7.292 12.58 <2e-16 *** s(I(gsd^0.5)) 4.009 5.138 18.25 <2e-16 *** s(I(mud^0.25)) 7.210 8.240 58.54 <2e-16 *** s(log(ssCmax)) 8.407 8.764 74.87 <2e-16 *** R-sq.(adj) = 0.303 Deviance explained = 51% REML score = 14355 Scale est. = 27.702 n = 4511 (I slightly shortened the output) RMSE did not improve: > sqrt(mean(residuals.gam(fit,type="response")^2)) [1] 1009.268 diagnostic plots in the following <http://r.789695.n4.nabble.com/file/n4665370/screen-capture-3.png> <http://r.789695.n4.nabble.com/file/n4665370/screen-capture-4.png> wich looks much better. The QQ-plot is closer to identity, the residuals are more evenly spread and much smaller. Still, the correlation of response and fitted values seems pretty low Hope this helps, Jan -- View this message in context: http://r.789695.n4.nabble.com/mgcv-how-select-significant-predictor-vars-when-using-gam-select-TRUE-using-automatic-optimization-tp4664510p4665370.html Sent from the R help mailing list archive at Nabble.com. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.