Dear all ,

I would like to use the function randomForest to predict the probability 
of relocation failure of a GPS collar as a function of several 
environmental variables x (both factor and numeric: slope, vegetation, 
etc.) on a given area. The response variable y is thus success 
(0)/failure(1) of the relocation, and the sampling unit is the pixel of 
a raster map. My aim is to build a map predicting the probability that a 
relocation will succeed P(y=1|x) at each point. I am tempted to use the 
function predict.randomForest() to estimate this probability (with 
type=”prob”).

If I understand correctly, this function returns the proportion of trees 
in the random forest voting in favour of the success or failure of the 
relocation. In the appendix of the paper cited as reference on the help 
page of the function randomForest() (Breiman, 2001. Random Forest), 
Breiman notes that these proportions of votes can be interpreted as the 
probability, calculated over all trees, that a tree, given the variables 
x and the training set, would classify correctly a relocation as 
success/failure (using Breiman's notations, P_\Theta( h(\Theta, x) = 
failure). I have found several threads on R-help related to 
predict.randomForest(..., type=”prob”) that confirm this interpretation 
of these probabilities (e.g., 
http://r.789695.n4.nabble.com/quot-prob-quot-in-predict-randomForest-td887278.html,
 
http://r.789695.n4.nabble.com/Random-Forest-AUC-td3006649.html).

However, I would like to know under which conditions (assumptions about 
the process, parameters of the randomForests, etc.) it is correct to use 
this proportion of votes as an estimate of the “true” probability 
P(failure | environment) caracterizing the relocation process. I 
searched the web and the literature, but I did not find any reference 
describing how these two probabilities are connected, although Breiman 
(2002; Manual On Setting Up, Using, And Understanding Random Forests 
V3.1) just noted that the proportion of votes “should not be interpreted 
as the underlying distributional probabilities”.

Could you point me toward some references about this problem, or give me 
ideas of the assumptions under which this approximation would be correct?
Thanks for any hint !
Best regards,

Clément Calenge
 > version
                _
platform       i686-pc-linux-gnu
arch           i686
os             linux-gnu
system         i686, linux-gnu
status         Under development (unstable)
major          2
minor          13.0
year           2011
month          02
day            06
svn rev        54234
language       R
version.string R version 2.13.0 Under development (unstable) (2011-02-06 
r54234)

-- 
Clément CALENGE
Cellule d'appui à l'analyse de données
Direction des Etudes et de la Recherche
Office national de la chasse et de la faune sauvage
Saint Benoist - 78610 Auffargis
tel. (33) 01.30.46.54.14


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to