Thank you very much for this suggestion, I was not aware of this package. Apart from this, is suggestion 2 (changing nodesize attribute) a good way to go? Experimenting with sampsize (suggestion 4) has yielded promising results.
Kind regards, Andreas Béguin 2010/5/24 Gabor Grothendieck <ggrothendi...@gmail.com> > You could also try the Boruta package for variable selection. > > 2010/5/24 Andreas Béguin <chaud...@gmail.com>: > > Dear R-help list members, > > > > I have a statistical question regarding the Random Forest function (RF) > as > > applied to ecological prediction of species presences and absences. > > > > RF seems to perform very well for prediction of species ranges or > > prevalences. However, the problem with my dataset is a high degree of > > spatial autocorrelation and therefore a low effective sample size > compared > > to the full number of gridpoints (0.5 degree grid extending over all land > > areas north of 55 deg. south, ~60000 grid points). My variables are to a > > high degree correlated in x and y direction. When using the entire > dataset > > in the RF function, the misclassification rate is unbelievably low, > > suggesting overfitting. The noisy marginal probability plots (see > attached > > example) somehow support this idea. My question is: Is there a way to > make > > the decision trees in RF more generalizable without modelling the spatial > > autocorrelation explicitly? Here are four ways of doing this I have > thought > > about: > > 1. Spatially clustering observations into training and test datasets and > > averaging the predicted class probability values to approximate "real" > > certainty - This could be done on country level or in a chessboard-like > > pattern > > 2. Requiring a higher minimal nodesize to prevent the creation of > > overfitted, maximal trees - Which value of "nodesize" might be > appropriate? > > 3. Reducing the number of variables involved in the model by just taking > one > > out of a group of correlated variables (say, for example, only winter > > temperature instead of temperatures from all seasons) - This variable > > selection would be based on the Variable Importance plots. I was > considering > > to use the Gini measure ranking instead of the accuracy ranking to > produce > > simpler, more "biological" trees, please comment on this. > > 4. Requiring RF to choose only a certain number of "TRUE" and "FALSE" > > ("presence"-"absence") observations using the "sampsize" option, thereby > > increasing the distance between the gridpoints chosen to build the model > so > > as to reduce correlation between observations. > > > > Which of these pathways would you suggest to pursue? Certainly some of > you > > have faced and tackled the problem of spatial autocorrelation in > ecological > > prediction. I am aware of the works of Araujo et al. (2005) and Koenig > > (1999), any further suggested reading (especially examples of how spatial > > autocorrelation can be dealt with practically) would be highly welcome. > > > > Kind regards, > > > > Andreas Beguin > > ########################################## > > Division of Epidemiology and Global Health > > Department of Public Health and Clinical Medicine > > Umea University > > 907 31 Umea Sweden > [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.