Re: [R] variable selection in logistic

Frank E Harrell Jr Thu, 03 Sep 2009 13:46:02 -0700

You'll need to do a huge amount of background reading first. Thesestepwise options do not incorporate penalization.


Frank


annie Zhang wrote:

Hi, Frank,

If I want to do prediction as well as to select important predictors,which may be the best function to use when I have 35 samples and 35predictors (penalized logistic with variable selection)? I saw there isa 'fastbw' function in the Design package. And there is a 'step.plr'function in the 'stepPlr' package.Thank you,Annie

On Thu, Sep 3, 2009 at 10:11 AM, Frank E Harrell Jr<f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu>> wrote:


    annie Zhang wrote:

        Thank you for all your reply.
        Actually as Bert said, besides predicion, I also need variable
        selection (I need to know which variables are important). As far
        as the sample size and number of variables, both of them are
        small around 35. How can I get accurate prediction as long as
        good predictors?
        Annie


    It is next to impossible to find a unique list of 'important'
    variables without having 50 times as many subjects as potential
    predictors, unless your signal:noise ratio is stunning.

    Frank


        On Thu, Sep 3, 2009 at 8:28 AM, Bert Gunter
        <gunter.ber...@gene.com <mailto:gunter.ber...@gene.com>
        <mailto:gunter.ber...@gene.com <mailto:gunter.ber...@gene.com>>>
        wrote:

           But let's be clear here folks:

           Ben's comment is apropos: ""As many variables as samples" is
           particularly
           scary."

           (Aside -- how much scarier then are -omics analyses in which the
           number of
           variables is thousands of times the number of samples?)

           Sensible penalization (it's usually not too sensitive to the
        details) is
           only another way of obtaining a parsimonious model with good
        (in the
           sense
           of minimizing overall prediction error: bias + variance)
        prediction
           properties. Alas, this is often not what scientists want:
        they use
           variable
           selection to find the "right" covariates, the "most
        important" variables
           affecting the response. But this is beyond the power of empirical
           modeling
           here: "as many variables as samples" almost guarantees that there
           will be
           many different and even nonoverlapping subsets of variables that
           are, within
           statistical noise, equally "optimal" predictors. That is,
        variable
           selection
           in such circumstances is just a pretty sophisticated random
        number
           generator
           -- ergo Frank's Draconian warnings. Penalization produces better
           prediction
           engines with better properties, but it cannot overcome the
        "as many
           variables as samples" problem either. Entropy rules. If what is
           sought is a
           way to determine the "truly important" variables, then the
        study must be
           designed to provide the information to do so. You don't get
           something for
           nothing.

           Cheers,

           Bert Gunter
           Genentech Nonclinical Biostatistics


           -----Original Message-----
           From: r-help-boun...@r-project.org
        <mailto:r-help-boun...@r-project.org>
           <mailto:r-help-boun...@r-project.org
        <mailto:r-help-boun...@r-project.org>>
           [mailto:r-help-boun...@r-project.org
        <mailto:r-help-boun...@r-project.org>
           <mailto:r-help-boun...@r-project.org
        <mailto:r-help-boun...@r-project.org>>] On
           Behalf Of Frank E Harrell Jr
           Sent: Wednesday, September 02, 2009 9:07 PM
           To: annie Zhang
           Cc: r-help@r-project.org <mailto:r-help@r-project.org>
        <mailto:r-help@r-project.org <mailto:r-help@r-project.org>>
           Subject: Re: [R] variable selection in logistic

           annie Zhang wrote:
            > Hi, Frank,
            >
            > You mean the backward and forward stepwise selection is
        bad? You also
            > suggest the penalized logistic regression is the best
        choice? Is
           there
            > any function to do it as well as selecting the best penalty?
            >
            > Annie

           All variable selection is bad unless its in the context of
        penalization.
            You'll need penalized logistic regression not necessarily with
           variable selection, for example a quadratic penalty as in a
        case study
           in my book, or an L1 penalty (lasso) using other packages.

           Frank

            >
            > On Wed, Sep 2, 2009 at 7:41 PM, Frank E Harrell Jr
            > <f.harr...@vanderbilt.edu
        <mailto:f.harr...@vanderbilt.edu>
        <mailto:f.harr...@vanderbilt.edu <mailto:f.harr...@vanderbilt.edu>>
           <mailto:f.harr...@vanderbilt.edu
        <mailto:f.harr...@vanderbilt.edu>
        <mailto:f.harr...@vanderbilt.edu
        <mailto:f.harr...@vanderbilt.edu>>>>

           wrote:
            >
            >     David Winsemius wrote:
            >
            >
            >         On Sep 2, 2009, at 9:36 PM, annie Zhang wrote:
            >
            >             Hi, R users,
            >
            >             What may be the best function in R to do variable
           selection
            >             in logistic
            >             regression?
            >
            >
            >         PhD theses, and books by famous statisticians have
        been
           pursuing
            >         the answer to that question for decades.
            >
            >             I have the same number of variables as the
        number of
           samples,
            >             and I want to select the best variablesfor
        prediction. Is
            >             there any function
            >             doing forward selection followed by backward
           elimination in
            >             stepwise
            >             logistic regression?
            >
            >
            >         You should probably be reading up on penalized
        regression
            >         methods. The stepwise procedures reporting unadjusted
            >         "significance" made available by SAS and SPSS to
        the unwary
            >         neophyte user have very poor statistical properties.
            >
            >         --
            >
            >         David Winsemius, MD
            >
            >
            >     Amen to that.
            >
            >     Annie, resist the temptation.  These methods bite.
            >
            >     Frank
            >
            >
            >         Heritage Laboratories
            >         West Hartford, CT
            >
            >         ______________________________________________
            >         R-help@r-project.org <mailto:R-help@r-project.org>
        <mailto:R-help@r-project.org <mailto:R-help@r-project.org>>
           <mailto:R-help@r-project.org <mailto:R-help@r-project.org>
        <mailto:R-help@r-project.org <mailto:R-help@r-project.org>>>
        mailing list

            >         https://stat.ethz.ch/mailman/listinfo/r-help
            >         PLEASE do read the posting guide
            >         http://www.R-project.org/posting-guide.html
        <http://www.r-project.org/posting-guide.html>
           <http://www.r-project.org/posting-guide.html>
            >         <http://www.r-project.org/posting-guide.html>
            >         and provide commented, minimal, self-contained,
           reproducible code.
            >
            >
            >
            >     --

> Frank E Harrell Jr Professor and ChairSchool of

           Medicine

> Department of BiostatisticsVanderbilt

           University
            >
            >


           --
           Frank E Harrell Jr   Professor and Chair           School of
        Medicine
                                Department of Biostatistics   Vanderbilt
        University

           ______________________________________________
           R-help@r-project.org <mailto:R-help@r-project.org>
        <mailto:R-help@r-project.org <mailto:R-help@r-project.org>>
        mailing list
           https://stat.ethz.ch/mailman/listinfo/r-help
           PLEASE do read the posting guide
           http://www.R-project.org/posting-guide.html
        <http://www.r-project.org/posting-guide.html>
           <http://www.r-project.org/posting-guide.html>
           and provide commented, minimal, self-contained, reproducible
        code.

--Frank E Harrell Jr Professor and Chair School of Medicine

                        Department of Biostatistics   Vanderbilt University



--
Frank E Harrell Jr   Professor and Chair           School of Medicine
                     Department of Biostatistics   Vanderbilt University

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] variable selection in logistic

Reply via email to