Re: [R] In rpart, how is "improve" calculated? (in the "class" case)

Tal Galili Tue, 14 Jun 2011 13:23:13 -0700

Hi Ed,
Thank you for the reply!

Professor Atkinson already gave me that answer by pointing me to the technical
report of rpart that describes this:
*http://mayoresearch.mayo.edu/mayo/research/biostat/upload/61.pdf*<http://mayoresearch.mayo.edu/mayo/research/biostat/upload/61.pdf>


However, I was also only able to reproduce the "gini" impurity, and not the
"information" one.
I hope either Professor Atkinson or some other member of the list could help
out with this.

In the meantime, I also found a bug in the code I sent to the mailing list,
bellow is the fixed code (also more organized):


 # creating data
set.seed(1324)
y <- sample(c(0,1), 20, T)
x <- y
x[1:5] <- 0

# manually making the first split
obs_L <- y[x<.5]
obs_R <- y[x>.5]
n_L <- sum(x<.5)
n_R <- sum(x>.5)
n <- length(x)


calc.impurity <- function(func = gini)
{
impurity_root <- func(prop.table(table(y)))
 impurity_L <- func(prop.table(table(obs_L)))
 impurity_R <-func(prop.table(table(obs_R)))
 imp <- impurity_root - ((n_L/n)*impurity_l + (n_R/n)*impurity_R) # 0.3757
 imp*n
}

# for "gini"
require(rpart)
fit <- rpart(y~x, method = "class", parms=list(split='gini'))
fit$split[,3] # 5.384615
gini <- function(p) {sum(p*(1-p))}
calc.impurity(gini) # 5.384615 # success!


# for "information" I fail...

fit <- rpart(y~x, method = "class", parms=list(split='information'))
fit$split[,3] # why is improve here 6.84029 ?

entropy <- function(p) {
if(any(p==1)) return(0) # works for the case when y has only 0 and 1
categories...
 -sum(p*log(p))
 }
calc.impurity(entropy) # 9.247559 != 6.84029





----------------Contact
Details:-------------------------------------------------------
Contact me: tal.gal...@gmail.com |  972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------




On Tue, Jun 14, 2011 at 10:56 PM, Ed Merkle <merk...@missouri.edu> wrote:

> Tal,
>
> For the Gini criterion, the "improve" value can be calculated as a weighted
> sum of the improvement in impurity.  Continuing with your original code:
>
>
> # for "gini"
> impurity_root<- gini(prop.table(table(y)))
> impurity_l<- gini(prop.table(table(obs_0)))
> impurity_R<-gini(prop.table(table(obs_1)))
>
> # (13 and 7 are sample sizes in respective nodes)
> 13*(impurity_root - impurity_l) + 7*(impurity_root - impurity_R)
> [1] 5.384615
>
> This does not appear to extend immediately to the information criterion,
> however.  I'm not sure about the 6.84.
>
> Ed
>
>
> On 6/14/11 5:00 AM, r-help-requ...@r-project.org wrote:
>
>> ------------------------------
>>
>> Message: 4
>> Date: Mon, 13 Jun 2011 15:47:26 +0300
>> From: Tal Galili<tal.gal...@gmail.com>
>> To:r-help@r-project.org
>> Subject: [R] In rpart, how is "improve" calculated? (in the "class"
>>         case)
>> Message-ID:<BANLkTimp1aFQoYrKina7H0Rnk=0zkr_...@mail.gmail.com>
>> Content-Type: text/plain
>>
>>
>> Hi all,
>>
>> I apologies in advance if I am missing something very simple here, but
>> since
>> I failed at resolving this myself, I'm sending this question to the list.
>>
>> I would appreciate any help in understanding how the rpart function is
>> (exactly) computing the "improve" (which is given in fit$split), and how
>> it
>> differs when using the split='information' vs split='gini' parameters.
>>
>> According to the help in rpart.object:
>> "improve, which is the improvement in deviance given by this split"
>>
>>> From what I understand, that would mean that the "improve" value should
>>> not
>>>
>> be different when using different "split" switches.  Since it is
>> different,
>> then I suspect that it is reflecting  the impurity measure somehow, but I
>> can't seem to understand how exactly.
>>
>> Bellow is some simple R code showing the result for a simple
>> classification
>> tree, with what the function outputs, and what I would have expected to
>> see
>> if "improve" were to simply reflect the change in impurity.
>>
>>
>> set.seed(1324)
>> y<- sample(c(0,1), 20, T)
>> x<- y
>> x[1:5]<- 0
>> require(rpart)
>> fit<- rpart(y~x, method = "class", parms=list(split='information'))
>> fit$split[,3] # why is improve here 6.84 ?
>> fit<- rpart(y~x, method = "class", parms=list(split='gini'))
>> fit$split[,3] # why is improve here 5.38 ?
>>
>>
>> # Here is what I thought it should have been:
>> # for "information"
>> entropy<- function(p) {
>> if(any(p==1)) return(0) # works for the case when y has only 0 and 1
>> categories...
>>  -sum(p*log(p,2))
>> }
>> gini<- function(p) {sum(p*(1-p))}
>>
>> obs_1<- y[x>.5]
>> obs_0<- y[x<.5]
>> n_l<- sum(x>.5)
>> n_R<- sum(x<.5)
>> n<- length(x)
>>
>> # for entropy (information)
>> impurity_root<- entropy(prop.table(table(y)))
>> impurity_l<- entropy(prop.table(table(obs_0)))
>> impurity_R<-entropy(prop.table(table(obs_1)))
>> # shouldn't this have been "improve" ??
>> impurity_root - ((n_l/n)*impurity_l + (n_R/n)*impurity_R) # 0.7272
>>
>> # for "gini"
>> impurity_root<- gini(prop.table(table(y)))
>> impurity_l<- gini(prop.table(table(obs_0)))
>> impurity_R<-gini(prop.table(table(obs_1)))
>> impurity_root - ((n_l/n)*impurity_l + (n_R/n)*impurity_R) # 0.3757
>>
>>
>> Thanks upfront,
>> Tal
>>
>>
>> ----------------Contact
>> Details:-------------------------------------------------------
>> Contact me:tal.gal...@gmail.com  |  972-52-7275845
>> Read me:www.talgalili.com  (Hebrew) |www.biostatistics.co.il  (Hebrew) |
>> www.r-statistics.com  (English)
>>
>> ----------------------------------------------------------------------------------------------
>>
>
> --
> *** Note new email address ***
> Ed Merkle, PhD
> Assistant Professor
> Department of Psychological Sciences (starting August 2011)
> University of Missouri
> Columbia, MO, USA 65211
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] In rpart, how is "improve" calculated? (in the "class" case)

Reply via email to