Thank you very much for this suggestion, I was not aware of this package.
Apart from this, is suggestion 2 (changing nodesize attribute) a good way to
go? Experimenting with sampsize (suggestion 4) has yielded promising
results.

Kind regards,
Andreas Béguin

2010/5/24 Gabor Grothendieck <ggrothendi...@gmail.com>

> You could also try the Boruta package for variable selection.
>
> 2010/5/24 Andreas Béguin <chaud...@gmail.com>:
>  > Dear R-help list members,
> >
> > I have a statistical question regarding the Random Forest function (RF)
> as
> > applied to ecological prediction of species presences and absences.
> >
> > RF seems to perform very well for prediction of species ranges or
> > prevalences. However, the problem with my dataset is a high degree of
> > spatial autocorrelation and therefore a low effective sample size
> compared
> > to the full number of gridpoints (0.5 degree grid extending over all land
> > areas north of 55 deg. south, ~60000 grid points). My variables are to a
> > high degree correlated in x and y direction. When using the entire
> dataset
> > in the RF function, the misclassification rate is unbelievably low,
> > suggesting overfitting. The noisy marginal probability plots (see
> attached
> > example) somehow support this idea. My question is: Is there a way to
> make
> > the decision trees in RF more generalizable without modelling the spatial
> > autocorrelation explicitly? Here are four ways of doing this I have
> thought
> > about:
> > 1. Spatially clustering observations into training and test datasets and
> > averaging the predicted class probability values to approximate "real"
> > certainty - This could be done on country level or in a chessboard-like
> > pattern
> > 2. Requiring a higher minimal nodesize to prevent the creation of
> > overfitted, maximal trees - Which value of "nodesize" might be
> appropriate?
> > 3. Reducing the number of variables involved in the model by just taking
> one
> > out of a group of correlated variables (say, for example, only winter
> > temperature instead of temperatures from all seasons) - This variable
> > selection would be based on the Variable Importance plots. I was
> considering
> > to use the Gini measure ranking instead of the accuracy ranking to
> produce
> > simpler, more "biological" trees, please comment on this.
> > 4. Requiring RF to choose only a certain number of "TRUE" and "FALSE"
> > ("presence"-"absence") observations using the "sampsize" option, thereby
> > increasing the distance between the gridpoints chosen to build the model
> so
> > as to reduce correlation between observations.
> >
> > Which of these pathways would you suggest to pursue? Certainly some of
> you
> > have faced and tackled the problem of spatial autocorrelation in
> ecological
> > prediction. I am aware of the works of Araujo et al. (2005) and Koenig
> > (1999), any further suggested reading (especially examples of how spatial
> > autocorrelation can be dealt with practically) would be highly welcome.
> >
> > Kind regards,
> >
> > Andreas Beguin
> > ##########################################
> > Division of Epidemiology and Global Health
> > Department of Public Health and Clinical Medicine
> > Umea University
> > 907 31 Umea Sweden
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to