Re: [R] variable selection in logistic

annie Zhang Thu, 03 Sep 2009 14:47:00 -0700

Thank you for all your suggestions. I will start with the chapter.

Annie


On Thu, Sep 3, 2009 at 1:50 PM, Don McKenzie <d...@u.washington.edu> wrote:

> Frank may be too modest to suggest it, but a great place to start that
> reading is in his book "Regression Modeling Strategies"  chapter 4.
>   On Sep 3, 2009, at 1:45 PM, Frank E Harrell Jr wrote:
>
>   You'll need to do a huge amount of background reading first.  These
> stepwise options do not incorporate penalization.
>
> Frank
>
> annie Zhang wrote:
>
> Hi, Frank,
>  If I want to do prediction as well as to select important predictors,
> which may be the best function to use when I have 35 samples and 35
> predictors (penalized logistic with variable selection)? I saw there is a
> 'fastbw' function in the Design package. And there is a 'step.plr' function
> in the 'stepPlr' package.
>  Thank you,
>  Annie
> On Thu, Sep 3, 2009 at 10:11 AM, Frank E Harrell Jr <
> f.harr...@vanderbilt.edu 
> <mailto:f.harr...@vanderbilt.edu<f.harr...@vanderbilt.edu>>>
> wrote:
>     annie Zhang wrote:
>         Thank you for all your reply.
>         Actually as Bert said, besides predicion, I also need variable
>         selection (I need to know which variables are important). As far
>         as the sample size and number of variables, both of them are
>         small around 35. How can I get accurate prediction as long as
>         good predictors?
>         Annie
>     It is next to impossible to find a unique list of 'important'
>     variables without having 50 times as many subjects as potential
>     predictors, unless your signal:noise ratio is stunning.
>     Frank
>         On Thu, Sep 3, 2009 at 8:28 AM, Bert Gunter
>         <gunter.ber...@gene.com 
> <mailto:gunter.ber...@gene.com<gunter.ber...@gene.com>
> >
>         <mailto:gunter.ber...@gene.com <gunter.ber...@gene.com> <
> mailto:gunter.ber...@gene.com <gunter.ber...@gene.com>>>>
>         wrote:
>            But let's be clear here folks:
>            Ben's comment is apropos: ""As many variables as samples" is
>            particularly
>            scary."
>            (Aside -- how much scarier then are -omics analyses in which
> the
>            number of
>            variables is thousands of times the number of samples?)
>            Sensible penalization (it's usually not too sensitive to the
>         details) is
>            only another way of obtaining a parsimonious model with good
>         (in the
>            sense
>            of minimizing overall prediction error: bias + variance)
>         prediction
>            properties. Alas, this is often not what scientists want:
>         they use
>            variable
>            selection to find the "right" covariates, the "most
>         important" variables
>            affecting the response. But this is beyond the power of
> empirical
>            modeling
>            here: "as many variables as samples" almost guarantees that
> there
>            will be
>            many different and even nonoverlapping subsets of variables
> that
>            are, within
>            statistical noise, equally "optimal" predictors. That is,
>         variable
>            selection
>            in such circumstances is just a pretty sophisticated random
>         number
>            generator
>            -- ergo Frank's Draconian warnings. Penalization produces
> better
>            prediction
>            engines with better properties, but it cannot overcome the
>         "as many
>            variables as samples" problem either. Entropy rules. If what is
>            sought is a
>            way to determine the "truly important" variables, then the
>         study must be
>            designed to provide the information to do so. You don't get
>            something for
>            nothing.
>            Cheers,
>            Bert Gunter
>            Genentech Nonclinical Biostatistics
>            -----Original Message-----
>            From: r-help-boun...@r-project.org
>         <mailto:r-help-boun...@r-project.org<r-help-boun...@r-project.org>
> >
>            <mailto:r-help-boun...@r-project.org<r-help-boun...@r-project.org>
>         <mailto:r-help-boun...@r-project.org<r-help-boun...@r-project.org>
> >>
>            [mailto:r-help-boun...@r-project.org<r-help-boun...@r-project.org>
>         <mailto:r-help-boun...@r-project.org<r-help-boun...@r-project.org>
> >
>            <mailto:r-help-boun...@r-project.org<r-help-boun...@r-project.org>
>         <mailto:r-help-boun...@r-project.org<r-help-boun...@r-project.org>>>]
> On
>            Behalf Of Frank E Harrell Jr
>            Sent: Wednesday, September 02, 2009 9:07 PM
>            To: annie Zhang
>            Cc: r-help@r-project.org 
> <mailto:r-help@r-project.org<r-help@r-project.org>
> >
>         <mailto:r-help@r-project.org <r-help@r-project.org> <
> mailto:r-help@r-project.org <r-help@r-project.org>>>
>            Subject: Re: [R] variable selection in logistic
>            annie Zhang wrote:
>             > Hi, Frank,
>             >
>             > You mean the backward and forward stepwise selection is
>         bad? You also
>             > suggest the penalized logistic regression is the best
>         choice? Is
>            there
>             > any function to do it as well as selecting the best penalty?
>             >
>             > Annie
>            All variable selection is bad unless its in the context of
>         penalization.
>             You'll need penalized logistic regression not necessarily with
>            variable selection, for example a quadratic penalty as in a
>         case study
>            in my book, or an L1 penalty (lasso) using other packages.
>            Frank
>             >
>             > On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr
>             > <f.harr...@vanderbilt.edu
>         <mailto:f.harr...@vanderbilt.edu <f.harr...@vanderbilt.edu>>
>         <mailto:f.harr...@vanderbilt.edu <f.harr...@vanderbilt.edu> <
> mailto:f.harr...@vanderbilt.edu <f.harr...@vanderbilt.edu>>>
>            <mailto:f.harr...@vanderbilt.edu <f.harr...@vanderbilt.edu>
>         <mailto:f.harr...@vanderbilt.edu <f.harr...@vanderbilt.edu>>
>         <mailto:f.harr...@vanderbilt.edu <f.harr...@vanderbilt.edu>
>         <mailto:f.harr...@vanderbilt.edu <f.harr...@vanderbilt.edu>>>>>
>            wrote:
>             >
>             >     David Winsemius wrote:
>             >
>             >
>             >         On Sep 2, 2009, at 9:36 PM, annie Zhang wrote:
>             >
>             >             Hi, R users,
>             >
>             >             What may be the best function in R to do
> variable
>            selection
>             >             in logistic
>             >             regression?
>             >
>             >
>             >         PhD theses, and books by famous statisticians have
>         been
>            pursuing
>             >         the answer to that question for decades.
>             >
>             >             I have the same number of variables as the
>         number of
>            samples,
>             >             and I want to select the best variablesfor
>         prediction. Is
>             >             there any function
>             >             doing forward selection followed by backward
>            elimination in
>             >             stepwise
>             >             logistic regression?
>             >
>             >
>             >         You should probably be reading up on penalized
>         regression
>             >         methods. The stepwise procedures reporting
> unadjusted
>             >         "significance" made available by SAS and SPSS to
>         the unwary
>             >         neophyte user have very poor statistical properties.
>             >
>             >         --
>             >
>             >         David Winsemius, MD
>             >
>             >
>             >     Amen to that.
>             >
>             >     Annie, resist the temptation.  These methods bite.
>             >
>             >     Frank
>             >
>             >
>             >         Heritage Laboratories
>             >         West Hartford, CT
>             >
>             >         ______________________________________________
>             >         R-help@r-project.org 
> <mailto:R-help@r-project.org<R-help@r-project.org>
> >
>         <mailto:R-help@r-project.org <R-help@r-project.org> <
> mailto:R-help@r-project.org <R-help@r-project.org>>>
>            <mailto:R-help@r-project.org <R-help@r-project.org> <
> mailto:R-help@r-project.org <R-help@r-project.org>>
>         <mailto:R-help@r-project.org <R-help@r-project.org> <
> mailto:R-help@r-project.org <R-help@r-project.org>>>>
>         mailing list
>             >         https://stat.ethz.ch/mailman/listinfo/r-help
>             >         PLEASE do read the posting guide
>             >         
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>         <http://www.r-project.org/posting-guide.html>
>            <http://www.r-project.org/posting-guide.html>
>             >         <http://www.r-project.org/posting-guide.html>
>             >         and provide commented, minimal, self-contained,
>            reproducible code.
>             >
>             >
>             >
>             >     --
>             >     Frank E Harrell Jr   Professor and Chair
>   School of
>            Medicine
>             >                         Department of Biostatistics
> Vanderbilt
>            University
>             >
>             >
>            --
>            Frank E Harrell Jr   Professor and Chair           School of
>         Medicine
>                                 Department of Biostatistics   Vanderbilt
>         University
>            ______________________________________________
>            R-help@r-project.org 
> <mailto:R-help@r-project.org<R-help@r-project.org>
> >
>         <mailto:R-help@r-project.org <R-help@r-project.org> <
> mailto:R-help@r-project.org <R-help@r-project.org>>>
>         mailing list
>            https://stat.ethz.ch/mailman/listinfo/r-help
>            PLEASE do read the posting guide
>            
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>         <http://www.r-project.org/posting-guide.html>
>            <http://www.r-project.org/posting-guide.html>
>            and provide commented, minimal, self-contained, reproducible
>         code.
>     --     Frank E Harrell Jr   Professor and Chair           School of
> Medicine
>                         Department of Biostatistics   Vanderbilt
> University
>
>
>
> --
> Frank E Harrell Jr   Professor and Chair           School of Medicine
>                      Department of Biostatistics   Vanderbilt University
>
> ______________________________________________
>  R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html<http://www.r-project.org/posting-guide.html>
>  and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
>
> Don McKenzie
> Research Ecologist
> Pacific Wildland Fire Sciences Lab
> US Forest Service
>
> Affiliate Professor
> College of Forest Resources and CSES Climate Impacts Group
> University of Washington
>
> phone: 206-732-7824
> cell: 206-321-5966
> d...@u.washington.edu
>
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] variable selection in logistic

Reply via email to