You'll need to do a huge amount of background reading first. These
stepwise options do not incorporate penalization.
Frank
annie Zhang wrote:
Hi, Frank,
If I want to do prediction as well as to select important predictors,
which may be the best function to use when I have 35 samples and 35
predictors (penalized logistic with variable selection)? I saw there is
a 'fastbw' function in the Design package. And there is a 'step.plr'
function in the 'stepPlr' package.
Thank you,
Annie
On Thu, Sep 3, 2009 at 10:11 AM, Frank E Harrell Jr
<f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu>> wrote:
annie Zhang wrote:
Thank you for all your reply.
Actually as Bert said, besides predicion, I also need variable
selection (I need to know which variables are important). As far
as the sample size and number of variables, both of them are
small around 35. How can I get accurate prediction as long as
good predictors?
Annie
It is next to impossible to find a unique list of 'important'
variables without having 50 times as many subjects as potential
predictors, unless your signal:noise ratio is stunning.
Frank
On Thu, Sep 3, 2009 at 8:28 AM, Bert Gunter
<gunter.ber...@gene.com <mailto:gunter.ber...@gene.com>
<mailto:gunter.ber...@gene.com <mailto:gunter.ber...@gene.com>>>
wrote:
But let's be clear here folks:
Ben's comment is apropos: ""As many variables as samples" is
particularly
scary."
(Aside -- how much scarier then are -omics analyses in which the
number of
variables is thousands of times the number of samples?)
Sensible penalization (it's usually not too sensitive to the
details) is
only another way of obtaining a parsimonious model with good
(in the
sense
of minimizing overall prediction error: bias + variance)
prediction
properties. Alas, this is often not what scientists want:
they use
variable
selection to find the "right" covariates, the "most
important" variables
affecting the response. But this is beyond the power of empirical
modeling
here: "as many variables as samples" almost guarantees that there
will be
many different and even nonoverlapping subsets of variables that
are, within
statistical noise, equally "optimal" predictors. That is,
variable
selection
in such circumstances is just a pretty sophisticated random
number
generator
-- ergo Frank's Draconian warnings. Penalization produces better
prediction
engines with better properties, but it cannot overcome the
"as many
variables as samples" problem either. Entropy rules. If what is
sought is a
way to determine the "truly important" variables, then the
study must be
designed to provide the information to do so. You don't get
something for
nothing.
Cheers,
Bert Gunter
Genentech Nonclinical Biostatistics
-----Original Message-----
From: r-help-boun...@r-project.org
<mailto:r-help-boun...@r-project.org>
<mailto:r-help-boun...@r-project.org
<mailto:r-help-boun...@r-project.org>>
[mailto:r-help-boun...@r-project.org
<mailto:r-help-boun...@r-project.org>
<mailto:r-help-boun...@r-project.org
<mailto:r-help-boun...@r-project.org>>] On
Behalf Of Frank E Harrell Jr
Sent: Wednesday, September 02, 2009 9:07 PM
To: annie Zhang
Cc: r-help@r-project.org <mailto:r-help@r-project.org>
<mailto:r-help@r-project.org <mailto:r-help@r-project.org>>
Subject: Re: [R] variable selection in logistic
annie Zhang wrote:
> Hi, Frank,
>
> You mean the backward and forward stepwise selection is
bad? You also
> suggest the penalized logistic regression is the best
choice? Is
there
> any function to do it as well as selecting the best penalty?
>
> Annie
All variable selection is bad unless its in the context of
penalization.
You'll need penalized logistic regression not necessarily with
variable selection, for example a quadratic penalty as in a
case study
in my book, or an L1 penalty (lasso) using other packages.
Frank
>
> On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr
> <f.harr...@vanderbilt.edu
<mailto:f.harr...@vanderbilt.edu>
<mailto:f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu>>
<mailto:f.harr...@vanderbilt.edu
<mailto:f.harr...@vanderbilt.edu>
<mailto:f.harr...@vanderbilt.edu
<mailto:f.harr...@vanderbilt.edu>>>>
wrote:
>
> David Winsemius wrote:
>
>
> On Sep 2, 2009, at 9:36 PM, annie Zhang wrote:
>
> Hi, R users,
>
> What may be the best function in R to do variable
selection
> in logistic
> regression?
>
>
> PhD theses, and books by famous statisticians have
been
pursuing
> the answer to that question for decades.
>
> I have the same number of variables as the
number of
samples,
> and I want to select the best variablesfor
prediction. Is
> there any function
> doing forward selection followed by backward
elimination in
> stepwise
> logistic regression?
>
>
> You should probably be reading up on penalized
regression
> methods. The stepwise procedures reporting unadjusted
> "significance" made available by SAS and SPSS to
the unwary
> neophyte user have very poor statistical properties.
>
> --
>
> David Winsemius, MD
>
>
> Amen to that.
>
> Annie, resist the temptation. These methods bite.
>
> Frank
>
>
> Heritage Laboratories
> West Hartford, CT
>
> ______________________________________________
> R-help@r-project.org <mailto:R-help@r-project.org>
<mailto:R-help@r-project.org <mailto:R-help@r-project.org>>
<mailto:R-help@r-project.org <mailto:R-help@r-project.org>
<mailto:R-help@r-project.org <mailto:R-help@r-project.org>>>
mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
<http://www.r-project.org/posting-guide.html>
<http://www.r-project.org/posting-guide.html>
> <http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained,
reproducible code.
>
>
>
> --
> Frank E Harrell Jr Professor and Chair
School of
Medicine
> Department of Biostatistics
Vanderbilt
University
>
>
--
Frank E Harrell Jr Professor and Chair School of
Medicine
Department of Biostatistics Vanderbilt
University
______________________________________________
R-help@r-project.org <mailto:R-help@r-project.org>
<mailto:R-help@r-project.org <mailto:R-help@r-project.org>>
mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
<http://www.r-project.org/posting-guide.html>
<http://www.r-project.org/posting-guide.html>
and provide commented, minimal, self-contained, reproducible
code.
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
--
Frank E Harrell Jr Professor and Chair School of Medicine
Department of Biostatistics Vanderbilt University
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.