On Tue, 2007-12-18 at 16:27 -0600, Naiara Pinto wrote: > Dear all, > > I would like to use a tree regression method to analyze my dataset. I > am interested in the fact that random forests creates in-bag and > out-of-bag datasets, but I also need an estimate of support for each > split. That seems hard to do in random forests since each tree is > grown using a subset of the predictor variables. > > I was thinking of setting mtry = number of predictor variables, > growing several trees, and computing the support for each node as the > number of times that a certain predictor variable was chosen for that > node. Can this be implemented using random forests?
Hi Naiara, I'm so not an expert here, but what you propose with mty = number of predictors will give you a procedure known as bagging. You talk about support for the split and then for the node. Is this just a typo or are you interested in the two different things? I'm not aware of how you do the latter in bagging or random forests as the whole point is to grow large trees not pruned ones. As to the former, trees are unstable, change the data used to train them just a little and you can get a very different fitted tree. Bagging and random forests exploit this to produce a better prediction machine / classifier by using n poor trees rather than one best tree. They do this by adding randomness to the procedure by bootstrap sampling the training data, and in the case of random forest, randomly sampling a small number, mtry, of available predictors to grow each tree. As such there is no correspondence between the splits of one tree and the splits of another, so trying to compare how many times a certain split in one or more trees is formed by the same predictor. So it doesn't make sense (to me it may to others) to focus on individual splits in the n trees. I don't know what you mean exactly by "support", but if you are trying to get a measure of how important each of your predictors is in explaining variance in your response, then take a look at the importance() function in the randomForest package. This produces a couple of measures that allow you to determine which predictors contribute most to reducing node impurity or MSE. HTH G > > Thanks! > > Naiara. > -- %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% Dr. Gavin Simpson [t] +44 (0)20 7679 0522 ECRC, UCL Geography, [f] +44 (0)20 7679 0565 Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/ UK. WC1E 6BT. [w] http://www.freshwaters.org.uk %~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~% ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.