Re: [R] R: How to prune using holdout data

Therneau, Terry M., Ph.D. Tue, 28 Feb 2017 06:27:49 -0800

Let me give an outline of how to answer Alfredo's question via an example.
I will split the data set "lung" into two peices.  For these subjects with
advanced lung cancer the physician's assessment of ECOG performance status
(ph.ecog) is one of the most powerful indicators of outcome.  Try to
predict it from other variables.


library(survival)   # for the test data set
library(rpart)

data1 <- lung[1:125,]
data2 <- lung[126:228,]

rfit1 <- rpart(ph.ecog ~ ., data=data1)
printcp(rfit1)

        CP nsplit rel error  xerror     xstd
1 0.565788      0   1.00000 1.04037 0.100516
2 0.098516      1   0.43421 0.44906 0.045001
3 0.042708      2   0.33570 0.35134 0.041692
4 0.031032      3   0.29299 0.37610 0.042971
5 0.019949      4   0.26196 0.37753 0.044692
6 0.010000      5   0.24201 0.39166 0.050332

# Validate using data2.  First get predictions for each of the pruned trees
cpvalues <- rfit1$cptable[,1]
pmat <- matrix(0, nrow(data2), length(cpvalues))
for (i in 1:length(cpvalues))
    pmat[,i] <- predict(prune(rfit1, cpvalues[i]), newdata=data2)

Now, we need to decide what on a measure of error.  Try simple squared error.

error <- colMeans((data2$ph.ecog - pmat)^2)
round(error, 3)
[1] 0.493 0.280 0.210 0.225 0.186 0.198

This is simple, but other cases are more complex.  The performace score is
actually an integer from 0-4 (5= dead), see
  http://ecog-acrin.org/resources/ecog-performance-status

table(lung$ph.ecog)
  0   1   2   3
 63 113  50   1

Suppose instead we fit a model and treat the response as categorical?
The total number of nested models is a bit smaller.

rfit2 <- rpart(ph.ecog ~ ., data=data1, method="class")

printcp(rfit2)
       CP nsplit rel error  xerror     xstd
1 0.35938      0   1.00000 1.00000 0.086951
2 0.12500      1   0.64062 0.64062 0.081854
3 0.06250      2   0.51562 0.70312 0.083662
4 0.03125      4   0.39062 0.57812 0.079610
5 0.01000      5   0.35938 0.56250 0.078977

 predict(rfit2, newdata=data2)[1:5,]
          0      1       2 3
126 0.03125 0.9375 0.03125 0
127 0.03125 0.9375 0.03125 0
128 0.03125 0.9375 0.03125 0
129 0.03125 0.9375 0.03125 0
130 0.37500 0.6250 0.00000 0

Now, we can ask for predicted probabilities for each class (default), which is 
a vector
of length 4 for each subject, or for the predicted class, which is a single 
value.  Which
do we want, and then what is the best measure of prediction error?
If three subjects with value 0 had prediction class vectors of (.8, .2, 0, 0),
(.8, .1, .1, 0) and (.45, .25, .2, .1), one outlook would say they all are the
same (all pick 0 as the best), others would give them different errors.  Is
the second prediction worse than the first?

What if the single subject with ph.ecog=3 had ended up in the validation data
set; how should we judge their prediction?

This complexity is one reason that there is not a simple function for
"validation" with a new data set.



On 02/27/2017 09:48 AM, Alfredo wrote:

Thank you, Terry, for your answer.

I’ll try to explain better my question. When you create a classification or 
regression
tree you first grow a tree based on a splitting criteria: this usually results 
in a large
tree that provides a good fit to the training data. The problem with this tree 
is its
potential for overfitting the data: the tree can be tailored too specifically 
to the
training data and not generalize well to new data. The solution (apart 
cross-validation)
is to find a smaller subtree that results in a low error rate on *holdout or 
validation data.*

Hope it helps to clarity my question.

Best,

Alfredo

-----Messaggio originale-----
Da: Therneau, Terry M., Ph.D. [mailto:thern...@mayo.edu]

You will need to give more detail of exactly what you mean by "prune using a 
validation
set". THe prune.rpart function will prune at any value you want, what I suspect 
you are
looking for is to compute the error of each possible tree, using a validation 
data set,
then find the best one, and then prune there.

How do you define "best"?


______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] R: How to prune using holdout data

Reply via email to