Hi all,

I apologies in advance if I am missing something very simple here, but since
I failed at resolving this myself, I'm sending this question to the list.

I would appreciate any help in understanding how the rpart function is
(exactly) computing the "improve" (which is given in fit$split), and how it
differs when using the split='information' vs split='gini' parameters.

According to the help in rpart.object:
"improve, which is the improvement in deviance given by this split"
>From what I understand, that would mean that the "improve" value should not
be different when using different "split" switches.  Since it is different,
then I suspect that it is reflecting  the impurity measure somehow, but I
can't seem to understand how exactly.

Bellow is some simple R code showing the result for a simple classification
tree, with what the function outputs, and what I would have expected to see
if "improve" were to simply reflect the change in impurity.


set.seed(1324)
y <- sample(c(0,1), 20, T)
x <- y
x[1:5] <- 0
require(rpart)
fit <- rpart(y~x, method = "class", parms=list(split='information'))
fit$split[,3] # why is improve here 6.84 ?
fit <- rpart(y~x, method = "class", parms=list(split='gini'))
fit$split[,3] # why is improve here 5.38 ?


# Here is what I thought it should have been:
# for "information"
entropy <- function(p) {
if(any(p==1)) return(0) # works for the case when y has only 0 and 1
categories...
 -sum(p*log(p,2))
}
gini <- function(p) {sum(p*(1-p))}

obs_1 <- y[x>.5]
obs_0 <- y[x<.5]
n_l <- sum(x>.5)
n_R <- sum(x<.5)
n <- length(x)

# for entropy (information)
impurity_root <- entropy(prop.table(table(y)))
impurity_l <- entropy(prop.table(table(obs_0)))
impurity_R <-entropy(prop.table(table(obs_1)))
# shouldn't this have been "improve" ??
impurity_root - ((n_l/n)*impurity_l + (n_R/n)*impurity_R) # 0.7272

# for "gini"
impurity_root <- gini(prop.table(table(y)))
impurity_l <- gini(prop.table(table(obs_0)))
impurity_R <-gini(prop.table(table(obs_1)))
impurity_root - ((n_l/n)*impurity_l + (n_R/n)*impurity_R) # 0.3757


Thanks upfront,
Tal


----------------Contact
Details:-------------------------------------------------------
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to