You could also try the Boruta package for variable selection. 2010/5/24 Andreas Béguin <chaud...@gmail.com>: > Dear R-help list members, > > I have a statistical question regarding the Random Forest function (RF) as > applied to ecological prediction of species presences and absences. > > RF seems to perform very well for prediction of species ranges or > prevalences. However, the problem with my dataset is a high degree of > spatial autocorrelation and therefore a low effective sample size compared > to the full number of gridpoints (0.5 degree grid extending over all land > areas north of 55 deg. south, ~60000 grid points). My variables are to a > high degree correlated in x and y direction. When using the entire dataset > in the RF function, the misclassification rate is unbelievably low, > suggesting overfitting. The noisy marginal probability plots (see attached > example) somehow support this idea. My question is: Is there a way to make > the decision trees in RF more generalizable without modelling the spatial > autocorrelation explicitly? Here are four ways of doing this I have thought > about: > 1. Spatially clustering observations into training and test datasets and > averaging the predicted class probability values to approximate "real" > certainty - This could be done on country level or in a chessboard-like > pattern > 2. Requiring a higher minimal nodesize to prevent the creation of > overfitted, maximal trees - Which value of "nodesize" might be appropriate? > 3. Reducing the number of variables involved in the model by just taking one > out of a group of correlated variables (say, for example, only winter > temperature instead of temperatures from all seasons) - This variable > selection would be based on the Variable Importance plots. I was considering > to use the Gini measure ranking instead of the accuracy ranking to produce > simpler, more "biological" trees, please comment on this. > 4. Requiring RF to choose only a certain number of "TRUE" and "FALSE" > ("presence"-"absence") observations using the "sampsize" option, thereby > increasing the distance between the gridpoints chosen to build the model so > as to reduce correlation between observations. > > Which of these pathways would you suggest to pursue? Certainly some of you > have faced and tackled the problem of spatial autocorrelation in ecological > prediction. I am aware of the works of Araujo et al. (2005) and Koenig > (1999), any further suggested reading (especially examples of how spatial > autocorrelation can be dealt with practically) would be highly welcome. > > Kind regards, > > Andreas Beguin > ########################################## > Division of Epidemiology and Global Health > Department of Public Health and Clinical Medicine > Umea University > 907 31 Umea Sweden
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.