Hi All,

This is my first time to seek help from the R-forum (and also conduct a formal 
data-mining analysis). I searched the archive a bit but didn't get responses 
that could totally address my question. Any comments would be highly 
appreciated.

I am using the rpart function to analyze factors that might contribute to 
heightened injury rate; the outcome is a continuous variable. After fitting the 
initial tree and pruning it, the final tree has five terminal nodes with 
cross-validation errors shown below,

        CP nsplit rel error xerror    xstd
1 0.139141      0   1.00000 1.0033 0.26163
2 0.128314      1   0.86086 1.2752 0.28481
3 0.036021      3   0.60423 1.4315 0.29652
4 0.022675      4   0.56821 1.5142 0.29749
5 0.020000      5   0.54554 1.4615 0.28818

My questions are:

 (1) Is this pruned tree even valid? The cross-validation error is exceedingly 
high, well above 1.00

(2) What contributes to the high cross-validation errors (xerror), and why did 
it go up but then down a little bit?

My guess was that the data is quite noisy; therefore, the splitting was pretty 
much based on random noise, resulting in poor prediction. I've found that it 
helps a bit to increase the minimum data points required at splitting and 
terminal nodes (at the expense of the overall R-square of the tree, 
inevitably), but still the problem lingers.

Any thoughts?


The initial tree command and the last five lines of the output is
tree1 <- rpart(overexertion ~, method = "anova", data = data, xval = 10,  
minbucket=4, minsplit=10, cp=0)
Root node error: 502364/347 = 1447.7
n=347 (179 observations deleted due to missingness)

> tree1$cptable[dim(tree1$cptable)[1] - 5:0, ]
             CP nsplit rel error   xerror      xstd
43 9.769565e-05     54 0.2926771 1.641099 0.2626767
44 5.053530e-05     55 0.2925794 1.640780 0.2626771
45 4.314452e-05     56 0.2925288 1.640926 0.2626727
46 2.960797e-05     57 0.2924857 1.640925 0.2626727
47 1.570814e-05     58 0.2924561 1.640941 0.2626724
48 0.000000e+00     59 0.2924404 1.640906 0.2626728

The pruning command is
fit9 <- prune(tree1, cp=.02)  #set the cost-complexity parameter at 0.2




Tin-chi Lin
Liberty Mutual Research Institute for Safety
71 Frankland Rd, Hopkinton, MA 01748

Ext: 732-7466
Phone: (508)4970266
Email: tin-chi....@libertymutual.com<mailto:tin-chi....@libertymutual.com>






        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to