Hi,

On Aug 21, 2009, at 9:47 AM, Peter Schüffler wrote:

Hi,

perhaps you can help me to find out, how to find the best Lambda in a LASSO-model.

I have a feature selection problem with 150 proteins potentially predicting Cancer or Noncancer. With a lasso model

fit.glm <- glmpath(x=as.matrix(X), y=target, family="binomial")

(target is 0, 1 <- Cancer non cancer, X the proteins, numerical in expression), I get following path (PICTURE 1) One of these models is the best, according to its crossvalidation (PICTURE 2), the red line corresponds to the best crossvalidation. Its produced by

cv <- cv.glmpath(x=as.matrix(X), y=unclass(T)-1, family="binomial", type ="response", plot.it=TRUE, se=TRUE) abline(v= cv$fraction[max(which(cv$cv.error==min(cv$cv.error)))], col="red", lty=2, lwd=3)


Does anyone know, how to conclude from the Normfraction in PICTURE 2 to the corresponding model in PICTURE 1? What is the best model? Which coefficients does it have? I can only see the best model's cross validation error, but not the actual model. How to see it?

None of your pictures came through, so I'm not sure exactly what you're trying to point out, but in general the cross validation will help you find the best value for lambda for the lasso. I think it's the value of lambda that you'll use for your downstream analysis.

I haven't used the glmpath package, but I have been using the glmnet package which is also by Hastie, newer, and I believe covers the same use cases as the glmpath library (though, to be honest, I'm not quite familiar w/ the cox proportions hazard model). Perhaps you might want to look into it.

Anyway, speaking from my experience w/ the glmnet packatge, you might try this:

1. Determine the best value of lambda using CV. I guess you can use MSE or R^2 as you see fit as your yardstick of "best."

2. Train a model over all of your data and ask it for the coefficients at the given value of lambda from 1.

3. See which proteins have non-zero coefficients.

<tongue-in-cheek>
4. Divine a biological story that is explained by your statistical findings

4. Publish.
</tongue-in-cheek>

I guess there are many ways to do model selection, and I'm not sure it's clear how effective they are (which isn't to say that you shouldn't don't do them)[1] ... you might want to further divide your data into training/tuning/test (somewhere between steps 1 and 2) as another means of scoring models.

HTH,
-steve

[1] http://hunch.net/?p=29

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
  |  Memorial Sloan-Kettering Cancer Center
  |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to