Hi, I am using the packages tree and rpart to build a classification tree to predict a 0/1 outcome. The package rpart has the advantage that the function plotcp gives a visual representation of the cross-validation results with a horizontal line indicating the 1 standard error rule, i.e. the recommendation to select the most parsimonious model (the smallest tree) whose error is not more than one standard error above the error of the best model.
However, in the rpart package I am not getting trees of all sizes but for example three sizes are 1,2,5 in one example I am working with, while with cv.tree in package tree it gives 1,2,3,4,5 like I would guess it should (weakest link pruning successively collapses the internal nodes that contrubute the least). What is the reason for this? A second problem I am having in both packages is that the cross-validation results are highly variable between different runs of the programs. This is not unexpected as cross-validations means that the dataset is randomly divided in 10 equal subsets, which can be done in a lot of different ways. One then hopes that the results do not depend on this very much, but I observed they do often. Should one then do this many times, e.g. 100, each time select the model using the 1 standard error rule, and in the end count which model got selected most often? Or rather do it many times and average the means and standard errors of the prediction error? Or does a very high variability in cross-validation results mean that the dataset is too small to reach conclusions? Kind regards and thanks for your help, Tom [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.