This appears to be a statistics, not an R-help question, so should probably be asked on a statistics list, not here (e.g. stats.stackexchange.com).
But if I understand your issue correctly, perhaps the heart f the matter is: why do you think a stable fit must explain a lot of the variation? Feel free to ignore if I I'm wrong or discuss further on the statistics list. Cheers, Bert On Sat, Sep 28, 2013 at 2:39 AM, E Joffe <ejo...@hotmail.com> wrote: > Hi all, > > I am using COX LASSO (glmnet / coxnet) regression to analyze a dataset of > 394 obs. / 268 vars. > I use the following procedure: > 1. Construct a coxnet on the entire dataset (by cv.glmnet) > 2. Pick the significant features by selecting the non-zero coefficient > under the best lambda selected by the model > 3. Build a coxph model with bi-directional stepwise feature selection > limited to the coxnet selected features. > > To validate the model I use both Brier score (library=peperr) and Harrel's > C-Index (library=Hmisc) with a bootstrap of 50 iterations. > > > MY QUESTION : I am getting fairly good C-Index (0.76) and Brier (0.07) > values for the models however per the coxnet the %Dev explained by the model > is at best 0.27 and when I plot the survfit of the coxph the plotted > confidence interval is very large. > What am I missing here ? > > %DEV=27% > > > > Brier score - 0.07 ($ipec.coxglmnet -> [1] 7.24) > C-Index - 0.76 ($cIndex -> [1] 0.763) > > > > DATA: [Private Health Information - can't publish] 394 obs./268 vars. > > CODE (need to define a dataset with 'time' and 'status' variables): > > library("survival") > library ("glmnet") > library ("c060") > library ("peperr") > library ("Hmisc") > > #creat Y (survival matrix) for glmnet > surv_obj <- Surv(dataset$time,dataset$status) > > > ## tranform categorical variables into binary variables with dummy for > dataset > predict_matrix <- model.matrix(~ ., data=dataset, > contrasts.arg = lapply > (dataset[,sapply(dataset, is.factor)], contrasts)) > > ## remove the statu/time variables from the predictor matrix (x) for > glmnet > predict_matrix <- subset (predict_matrix, select=c(-time,-status)) > > ## create a glmnet cox object using lasso regularization and cross > validation > glmnet.cv <- cv.glmnet (predict_matrix, surv_obj, family="cox") > > ## get the glmnet model on the full dataset > glmnet.obj <- glmnet.cv$glmnet.fit > > # find lambda index for the models with least partial likelihood > deviance (by cv.glmnet) > optimal.lambda <- glmnet.cv$lambda.min # For a more parsimoneous > model use lambda.1se > lambda.index <- which(glmnet.obj$lambda==optimal.lambda) > > > # take beta for optimal lambda > optimal.beta <- glmnet.obj$beta[,lambda.index] > > # find non zero beta coef > nonzero.coef <- abs(optimal.beta)>0 > selectedBeta <- optimal.beta[nonzero.coef] > > # take only covariates for which beta is not zero > selectedVar <- predict_matrix[,nonzero.coef] > > # create a dataframe for trainSet with time, status and selected > variables in binary representation for evaluation in pec > reformat_dataSet <- as.data.frame(cbind(surv_obj,selectedVar)) > > # glmnet.cox only with meaningful features selected by stepwise > bidirectional AIC feature selection > glmnet.cox.meaningful <- step(coxph(Surv(time,status) ~ > .,data=reformat_dataSet),direction="both") > > > > > ##-------------------------------------------------------------------------- > ----------------------------- > ## MODEL PERFORMANCE > > ##-------------------------------------------------------------------------- > ----------------------------- > ## > > > ## Calculate the Brier score - pec does its own bootstrap so this > function runs on i=51 (i.e., whole trainset) > > ## Brier score calculation to cox-glmnet > peperr.coxglmnet <- peperr(response=surv_obj, x=selectedVarCox, > fit.fun=fit.coxph, load.all=TRUE, > indices=resample.indices(n=nrow(surv_obj), > method="boot", sample.n=50)) > > # Get error predictions from peperr > prederr.coxglmnet <- perr(peperr.coxglmnet) > > # Integrated prediction error Brier score calculation > ipec.coxglmnet<-ipec(prederr.coxglmnet, > eval.times=peperr.coxglmnet$attribute, response=surv_obj) > > > ## C-Index calculation 50 iter bootstrapping > > for (i in 1:50){ > print (paste("Iteration:",i)) > train <- sample(1:nrow(dataset), nrow(dataset), replace = TRUE) ## > random sampling with replacement > # create a dataframe for trainSet with time, status and selected > variables in binary representation for evaluation in pec > reformat_trainSet <- reformat_dataSet [train,] > > > # glmnet.cox only with meaningful features selected by stepwise > bidirectional AIC feature selection > glmnet.cox.meaningful.test <- step(coxph(Surv(time,status) ~ > .,data=reformat_dataSet),direction="both") > > selectedVarCox <- > predict_matrix[,attr(glmnet.cox.meaningful.test$terms,"term.labels")] > reformat_testSet <- as.data.frame(cbind(surv_obj,selectedVarCox)) > reformat_testSet <- reformat_dataSet [-train,] > > > # compute c-index (Harrell's) for cox-glmnet models > if (is.null(glmnet.cox.meaningful)){ > cIndexCoxglmnet <- c(cIndexCoxglmnet,NULL) > }else{ > cIndexCoxglmnet <- c(cIndexCoxglmnet, > 1-rcorr.cens(predict(glmnet.cox.meaningful, > reformat_testSet),Surv(reformat_testSet$time,reformat_testSet$status))[1]) > } > } > > #Get average C-Index > cIndex<- mean (unlist(cIndexCoxglmnet),rm.na=TRUE) > > #create a list of all the objects generated > > assign(name,c(eval(parse(text=name)),glmnet.cv=list(glmnet.cv),glmnet.obj=li > st(glmnet.obj), > > selectedVar=list(colnames(selectedVar)),glmnet.cox=list(glmnet.cox), > > glmnet.cox.meaningful=list(glmnet.cox.meaningful),ipec.coxglmnet=list(ipec.c > oxglmnet), > cIndex=cIndex)) > > # save image of the workspace after each iteration > save.image("final_subgroup_analysissubgroup_analysis.RData") > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.