Re: [R] coxph won't converge when including categorical (factor) variables

Marc Schwartz Sat, 06 Jul 2013 06:48:10 -0700

On Jul 6, 2013, at 7:04 AM, E Joffe <ejo...@hotmail.com> wrote:

> Hello,
> 
> 
> 
> [rephrasing and reposting  of a previous question (that was not answered)
> with new information]
> 
> 
> 
> I have a dataset of 371 observations.
> 
> When I run coxph with numeric variables it works fine.
> 
> However, when I try to add factor (categorical) variables it returns "Ran
> out of iterations and the model did not converge"
> 
> 
> 
> Of note, when I restructure all factors to binary variables with dummy and
> use glmnet-lasso the model converges.
> 
> 
> 
> Here are examples of the code and output (including summary description of
> the variables):
> 
>> maxSTree.cox <- coxph (Surv(time,status)~Chemo_Simple, data=dataset)
> 
> 
> 
> Warning message:
> 
> In fitter(X, Y, strats, offset, init, control, weights = weights,  :
> 
>  Ran out of iterations and did not converge
> 
> 
> 
>> summary (dataset$Chemo_Simple)
> 
>                        Anthra_HDAC       Anthra_Plus       ArsenicAtra
> ATRA           ATRA_GO 
> 
>                0               163                 2                12
> 0                 2 
> 
>         ATRA_IDA Demeth_HistoneDAC          Flu_HDAC     Flu_HDAC_plus
> HDAC_Clof         HDAC_only 
> 
>                0                34                37                 4
> 24                 1 
> 
>        HDAC_Plus        LowArac       LowDAC_Clof         MYLO_IL11
> Phase1 
> 
>                4                 8                30                 5
> 5 
> 
>              SCT    StdARAC_Anthra      StdAraC_Plus          Targeted
> VNP40101M 
> 
>                0                 0                 0                13
> 23 
> 
> 
> 
> 
> 
> HELP !!!!
>



You have 371 observations, but did not indicate how many events you have in 
that dataset. A cross tabulation of the 'status' factor with Chemo_Simple is 
likely to be enlightening to get a sense of the distribution of events for each 
level of Chemo_Simple.

Chemo_Simple has a number of levels with rather low counts (ignoring the levels 
with 0's for the moment), which is likely to be a part of the problem. There is 
likely to be an issue fitting the model for some of these levels, bearing in 
mind that your "reference level" of Anthra_HDAC has a large proportion of the 
observations and with the default treatment contrasts, each of the other levels 
will be compared to it.

You should also run:

  dataset$Chemo_Simple <- factor(dataset$Chemo_Simple)

to get rid of the unused factor levels. A quick check of the coxph() code 
versus, for example, lm()/glm(), reveals that while the latter will drop unused 
factor levels when creating the internal model dataframe, the former will not. 
A check of some test output here with coxph() suggests that the unused factor 
levels will simply appear as NA's in the coxph() output without affecting the 
levels that are present, but it will be cleaner to remove them a priori.

It is also likely that you don't have enough events to handle the effective 
covariate degrees of freedom that this single factor model will have. By my 
count (the formatting above is corrupted), you have 16 non-zero factor levels, 
which would be 15 covariate degrees of freedom consumed by this single factor. 
A general rule of thumb for Cox models (and logistic regression) is to have 20 
events per covariate degree of freedom to avoid overfitting, which means that 
you would really need 300 "events". In this context, events are the smaller 
count of the two possible levels of "status". Since you only have 371 total 
observations, there is no way that you could have 300 events, since you would 
need to have at least 600 observations. Thus, it is likely that you are 
attempting to overfit the model to the data as well.

The issue of using glmnet on a matrix of separate dummy variables and it 
working is likely to be an outcome of the LASSO method penalizing/shrinking the 
factor levels that are irrelevant to have coefficients of 0. Thus, I would 
envision that the effective number of your factor levels is reduced as a 
consequence, allowing the model to be fit.

You likely need to consider collapsing some of the low count factor levels into 
an "Other" category, if it makes contextual sense to do so for your data. 

Regards,

Marc Schwartz

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] coxph won't converge when including categorical (factor) variables

Reply via email to