Claire Wooton wrote:
Steve Lianoglou <mailinglist.honeypot <at> gmail.com> writes:
Hi Claire,
I'm replying and CC-ing to the R-help list to get more eyes on your
question since others will likely have more/better advice, and perhaps
someone else in the future will have a similar question, and might
find this thread handy.
I've removed your specific research aim since that might be private
information, but you can include that later if others find it
necessary to know in order help.
On Apr 5, 2010, at 5:44 PM, Claire Wooton wrote:
Dear Steve,
I came across your posting on the R-help mailing list concerning finding the
best lambda in a LASSO-model,
and I was wondering whether you would be able to offer any advice based on
your experience.
I'm attempting to build a logistic regression model to explore [REDACTED]
and recently decided to build a
LASSO-model, having learned of the problems with stepwise variable selection.
While I've done a fair
amount of reading on the topic, I'm still a bit uncertain when it comes to
selecting an appropriate value
for lambda when using the glmpath package.
Any advice you could offer would be much appreciated.
In general, what I've done is to use cross validation to find this
"best" value for lambda, which I'm defining as the value of lambda
that gives me the model with the lowest "objective score" on my
testing data.
The "objective score" is in quotes, because it can change given the
problem. For instance, for normal regression, the best objective score
could be the "lowest mean squared error" (or highest spearman rank) on
my held out examples. In your case, for logistic regression, this
could just be accuracy of the class labels.
So, I do the CV and get 1 value of lambda for each fold in the CV that
returns the model that has the best generalization properties on held
out data. After doing the 10 fold cv (once, or many times), you could
take the avg. value for lambda and use that for my 'downstream
analysis' by building a model on all of my data with that value of
lambda.
I'd also do some smoke tests to see how sensitive your model is w.r.t
the data it is given to train on. Do your best lambdas over each fold
vary a lot? How different is the model between folds -- are the same
predictor vars non-zero? What's their variance? Etc.
Also, what's your objective in building the model? Do you just want
something with high predictive accuracy? Are you trying to draw
conclusions on the model that you build -- like infer meaning from its
coefs?
This should probably go in the beginning of the email, but it's better
late than never:
I should add the disclaimer that I'm not a "real statistician," and
I'm "calling uncle" in advance to the card carrying statisticians on
this list that might argue that (i) this approach isn't principled
enough, (ii) you shouldn't really take any statistical advice on a
mailing list; and (iii) you'd be best off consulting a local
statistician.
Does that answer your question? If not, could you elaborate more about
what you're after?
Please don't forget to CC the R-help list on any further communication.
Thanks,
-steve
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
Hi Steve,
Thanks very much for your reply. My main objective in building the model is to
determine the relative strength of the variables in predicting my
presence/absence data. It's really an exploratory method, I'm interested in
whether the associations that have been observed out in the field come out in
the model. I'm also using rpart to build a classification tree to get a sense of
any interactions.
rpart is not able to do that. Apparent interactions from trees are more
often than not spurious. To see this, simulate a dataset where males
have an age range of 10-90 and females have a range 40-50. You will see
splits on age for males but not for females. This has nothing to do
with interactions.
Frank
I was planning to use cross-validation to identify a value of lambda that gives
minimum mean cv error and the largest value of lambda such that error is within
1 SE of the minimum. I'm not entirely sure how to proceed in building the full
model using this value of lambda. At this point do I simply use predict.glmpath
(or predict.glmnet) setting the value of "s" to lambda and return the
coefficients? I plan to validate the chosen coefficient estimates through a
bootstrap analysis.
Beyond conducting this "smoke test", I'm wondering how I should assess the
resulting model. Can I assess the fit and predictive accuracy of a glmnet
object?
Thanks again for your help. I am also planning on discussing my work with a
professor in statistics. I appreciate the insight though as I attempt to wrap my
head around these methods.
Cheers,
Claire
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Frank E Harrell Jr Professor and Chairman School of Medicine
Department of Biostatistics Vanderbilt University
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.