--- included message ---- Thus, my question is: *What common measures exists for ranking/measuring variable importance of participating variables in a CART model? And how can this be computed using R (for example, when using the rpart package)*
---end ---- Consider the following printout from rpart summary(rpart(time ~ age + ph.ecog + pat.karno, data=lung)) Node number 1: 228 observations, complexity param=0.03665178 mean=305.2325, MSE=44176.93 left son=2 (81 obs) right son=3 (147 obs) Primary splits: pat.karno < 75 to the left, improve=0.03661157, (3 missing) ph.ecog < 1.5 to the right, improve=0.03620793, (1 missing) age < 75.5 to the right, improve=0.01606491, (0 missing) Surrogate splits: ph.ecog < 1.5 to the right, agree=0.787, adj=0.392, (3 split) age < 72.5 to the right, agree=0.680, adj=0.089, (0 split) In Breiman, Friedman, Olshen, & Stone, the canonical CART book, the pat.karno variable would get .0366 "points" for this split, ph.ecog would get .0366 * .392 points age would get .0366 * .089 points The reason for adding in surrogates is to account for redundant variables. Suppose for instance that x1=height but so is x10, just measured on a different day. They won't be exactly the same, so one will get picked over the other at any given split; but at the end they should get the same importance score. This calculation is added up over all the splits to get a variable importance. So -- all the necessary ingredients are present. Someone just needs to write the importance function :-) Terry T. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.