Xochitl CORMON <Xochitl.Cormon <at> ifremer.fr> writes: > Le 28/02/2013 17:22, Ben Bolker a écrit : > > Xochitl CORMON<Xochitl.Cormon<at> ifremer.fr> writes:
> >> I am encountering some issues with my data and need some help. I am > >> trying to run glm analysis with a presence/absence variable as > >> response variable and several explanatory variable (time, > >> location, presence/absence data, abundance data). [snip] > >> * logistf : analysis does not run When running logistf() I get a > >> error message saying : # error in chol.default(x) : # leading minor > >> 39 is not positive definite I looked into logistf package manual, > >> on Internet, in the theoretical and technical paper of Heinze and > >> Ploner and cannot find where this function is used and if the error > >> can be fixed by some settings. > > > > chol.default is a function for Cholesky decomposition, which is going > > to be embedded fairly deeply in the code ... > > If I understand good I should just not use this package as this error is > not easily fixable ? Yes. > >> * brglm : analysis run However I get a warning message saying : # > >> In fit.proc(x = X, y = Y, weights = weights, start = start, > >> etastart # = etastart, : # Iteration limit reached Like before i > >> cannot find where and why this function is used while running the > >> package and if it can be fixed by adjusting some settings. > >> > >> In a more general way, I was wondering what are the fundamental > >> differences of these packages. > > > > You might also take a crack with bayesglm() in the arm package, which > > should (?) be able to overcome the separation problem by specifying a > > not-completely-uninformative prior. > > Thank you for the tip I will have a look into this package and its doc > tomorrow. Do you have any idea of what is this fit.proc function ? It is again deep inside brglm. You can use debug() to try to follow the process, but it will probably not help much. **However** you should definitely see ?brglm and especially ?brglm.control ... adding control.brglm=brglm.control(br.maxit=1000) to your function call might help (the default is 100) > Here an extract of my table and the different formula I run : > >> > >>> head (CPUE_table) > >> Year Quarter Subarea Latitude Longitude Presence.S CPUE.S > >> Presence.H CPUE.H Presence.NP CPUE.NP Presence.BW CPUE.BW > >> Presence.C CPUE.C Presence.P CPUE.P Presence.W CPUE.W 1 2000 1 31F1 > >> 51.25 1.5 0 0 0 0 0 0 0 0 1 76.002 0 0 1 3358.667 > > > > [snip] > > > >> logistf_binomPres<- logistf (Presence.S ~ (Presence.BW + Presence.W > >> + Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + > >> CPUE.H + CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + > >> Latitude + Longitude)^2, data = CPUE_table) > >> > >> Brglm_binomPres<- brglm (Presence.S ~ (Presence.BW + Presence.W + > >> Presence.C + Presence.NP +Presence.P + Presence.H +CPUE.BW + CPUE.H > >> + CPUE.P + CPUE.NP + CPUE.W + CPUE.C + Year + Quarter + Latitude + > >> Longitude)^2, family = binomial, data = CPUE_table) > > > > It's not much to go on, but: > > > * are you overfitting your data? That is, do you have at least 20 > > times as many 1's or 0's (whichever is rarer) as the number of > > parameters you are trying to estimated? > > I have 16 explanatory variable and with interactions we go to 136 > parameters. > > > length (which((CPUE_table)[,]== 0)) > [1] 33466 > > > length (which((CPUE_table)[,]== 1)) > [1] 17552 > > I assume the over fitting is good, isn't it? No, overfitting is *bad*. But you do have a large data set, so you might get away with fitting a model that's this complex. I would certainly start by seeing if you can successfully fit your model with main effects only (i.e. temporarily get rid of the ^2) I don't think that your statements above really count the zeros and ones in the _response_ variable -- I think you need table(CPUE_table$Presence.S) > > * have you examined your data graphically and looked for any strong > > outliers that might be throwing off the fit? > > I did check my data graphically in a lot and different ways. However if > you have any particular suggestions, please let me know. Concerning > strong outliers, I do not really understand what you mean. I have > outliers here and there but how can I know that they are strong enough > to throw off the fit? Most of the time they are really high abundance > coming from the fact that I'm using survey data and probably related to > the fact that the boat fished over a fish school. > > > * do you have some strongly correlated/multicollinear predictors? > > It's survey data so they indeed are correlated in time and space. > However I checked the scatterplot matrix and I didn't notice any linear > relation between variable. > > > * for what it's worth it looks like a variety of your variables > > might be dummy variables, which you can often express more compactly > > by using a factor variable and letting R construct the design matrix > > (i.e. generating the dummy variables on the fly), although that > > shouldn't change your results > > I will check about dummy variable concept as to be honest I don't really > understand what it means... > What it means is that if you have a categorical factor with several levels (e.g. A, B, C) the way to fit a linear model is to construct 'dummy variables' or 'contrasts' so that instead of var = A B C C B B A you have three variables: intercept B C 1 0 0 1 1 0 1 0 1 1 0 1 1 1 0 1 1 0 1 0 0 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.