From: SR Millis <srmil...@yahoo.com>
To: Jin Minming <jminm...@yahoo.com> 
Sent: Monday, January 30, 2012 9:25 AM
Subject: Re: [R] Variable selection based on both training and testing data
 

Jim,

First, stepwise methods for variable selection should be avoided.  Frank 
Harrell (in Regression Modeling Strategies) discusses this at length.

Second, splitting a dataset into training and validation sets is generally not 
a good idea unless you have a really large sample, eg, > 20,000.  As Harrell 
has discussed, split-sample validation does not provide external validation, is 
terribly inefficient, and is arbitrary.  It's better to specify your model a 
priori and use the bootstrap to obtain an estimate of your model's 
over-optimism.  Bootstrapping can be implemented with Harrell's rms package in 
R.

Scott
 
~~~~~~~~~~~
Scott R Millis, PhD, ABPP, CStat, PStat®
Professor
Wayne State University School of Medicine
Email:  aa3...@wayne.edu
Email:  srmil...@yahoo.com
Tel: 313-993-8085


________________________________

To: r-help@r-project.org 
Sent: Monday, January 30, 2012 8:14 AM
Subject: [R] Variable selection based on both training and testing data

Dear all,

The variable selection in regression is usually determined by the training data 
using AIC or F value, such as stepAIC. Is there some R package that can 
consider both the training and test dataset? For example, I have two separate 
training data and test data. Firstly, a regression model is obtained by using 
training data, and then this model is tested by using test data. This process 
continues in order to find some possible optimal models in terms of RMSE or R2 
for both training and test data. 

Thanks,

Jim

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained,
 reproducible code.
        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to