Hi Ed, Thank you for the reply! Professor Atkinson already gave me that answer by pointing me to the technical report of rpart that describes this: *http://mayoresearch.mayo.edu/mayo/research/biostat/upload/61.pdf*<http://mayoresearch.mayo.edu/mayo/research/biostat/upload/61.pdf>
However, I was also only able to reproduce the "gini" impurity, and not the "information" one. I hope either Professor Atkinson or some other member of the list could help out with this. In the meantime, I also found a bug in the code I sent to the mailing list, bellow is the fixed code (also more organized): # creating data set.seed(1324) y <- sample(c(0,1), 20, T) x <- y x[1:5] <- 0 # manually making the first split obs_L <- y[x<.5] obs_R <- y[x>.5] n_L <- sum(x<.5) n_R <- sum(x>.5) n <- length(x) calc.impurity <- function(func = gini) { impurity_root <- func(prop.table(table(y))) impurity_L <- func(prop.table(table(obs_L))) impurity_R <-func(prop.table(table(obs_R))) imp <- impurity_root - ((n_L/n)*impurity_l + (n_R/n)*impurity_R) # 0.3757 imp*n } # for "gini" require(rpart) fit <- rpart(y~x, method = "class", parms=list(split='gini')) fit$split[,3] # 5.384615 gini <- function(p) {sum(p*(1-p))} calc.impurity(gini) # 5.384615 # success! # for "information" I fail... fit <- rpart(y~x, method = "class", parms=list(split='information')) fit$split[,3] # why is improve here 6.84029 ? entropy <- function(p) { if(any(p==1)) return(0) # works for the case when y has only 0 and 1 categories... -sum(p*log(p)) } calc.impurity(entropy) # 9.247559 != 6.84029 ----------------Contact Details:------------------------------------------------------- Contact me: tal.gal...@gmail.com | 972-52-7275845 Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) | www.r-statistics.com (English) ---------------------------------------------------------------------------------------------- On Tue, Jun 14, 2011 at 10:56 PM, Ed Merkle <merk...@missouri.edu> wrote: > Tal, > > For the Gini criterion, the "improve" value can be calculated as a weighted > sum of the improvement in impurity. Continuing with your original code: > > > # for "gini" > impurity_root<- gini(prop.table(table(y))) > impurity_l<- gini(prop.table(table(obs_0))) > impurity_R<-gini(prop.table(table(obs_1))) > > # (13 and 7 are sample sizes in respective nodes) > 13*(impurity_root - impurity_l) + 7*(impurity_root - impurity_R) > [1] 5.384615 > > This does not appear to extend immediately to the information criterion, > however. I'm not sure about the 6.84. > > Ed > > > On 6/14/11 5:00 AM, r-help-requ...@r-project.org wrote: > >> ------------------------------ >> >> Message: 4 >> Date: Mon, 13 Jun 2011 15:47:26 +0300 >> From: Tal Galili<tal.gal...@gmail.com> >> To:r-help@r-project.org >> Subject: [R] In rpart, how is "improve" calculated? (in the "class" >> case) >> Message-ID:<BANLkTimp1aFQoYrKina7H0Rnk=0zkr_...@mail.gmail.com> >> Content-Type: text/plain >> >> >> Hi all, >> >> I apologies in advance if I am missing something very simple here, but >> since >> I failed at resolving this myself, I'm sending this question to the list. >> >> I would appreciate any help in understanding how the rpart function is >> (exactly) computing the "improve" (which is given in fit$split), and how >> it >> differs when using the split='information' vs split='gini' parameters. >> >> According to the help in rpart.object: >> "improve, which is the improvement in deviance given by this split" >> >>> From what I understand, that would mean that the "improve" value should >>> not >>> >> be different when using different "split" switches. Since it is >> different, >> then I suspect that it is reflecting the impurity measure somehow, but I >> can't seem to understand how exactly. >> >> Bellow is some simple R code showing the result for a simple >> classification >> tree, with what the function outputs, and what I would have expected to >> see >> if "improve" were to simply reflect the change in impurity. >> >> >> set.seed(1324) >> y<- sample(c(0,1), 20, T) >> x<- y >> x[1:5]<- 0 >> require(rpart) >> fit<- rpart(y~x, method = "class", parms=list(split='information')) >> fit$split[,3] # why is improve here 6.84 ? >> fit<- rpart(y~x, method = "class", parms=list(split='gini')) >> fit$split[,3] # why is improve here 5.38 ? >> >> >> # Here is what I thought it should have been: >> # for "information" >> entropy<- function(p) { >> if(any(p==1)) return(0) # works for the case when y has only 0 and 1 >> categories... >> -sum(p*log(p,2)) >> } >> gini<- function(p) {sum(p*(1-p))} >> >> obs_1<- y[x>.5] >> obs_0<- y[x<.5] >> n_l<- sum(x>.5) >> n_R<- sum(x<.5) >> n<- length(x) >> >> # for entropy (information) >> impurity_root<- entropy(prop.table(table(y))) >> impurity_l<- entropy(prop.table(table(obs_0))) >> impurity_R<-entropy(prop.table(table(obs_1))) >> # shouldn't this have been "improve" ?? >> impurity_root - ((n_l/n)*impurity_l + (n_R/n)*impurity_R) # 0.7272 >> >> # for "gini" >> impurity_root<- gini(prop.table(table(y))) >> impurity_l<- gini(prop.table(table(obs_0))) >> impurity_R<-gini(prop.table(table(obs_1))) >> impurity_root - ((n_l/n)*impurity_l + (n_R/n)*impurity_R) # 0.3757 >> >> >> Thanks upfront, >> Tal >> >> >> ----------------Contact >> Details:------------------------------------------------------- >> Contact me:tal.gal...@gmail.com | 972-52-7275845 >> Read me:www.talgalili.com (Hebrew) |www.biostatistics.co.il (Hebrew) | >> www.r-statistics.com (English) >> >> ---------------------------------------------------------------------------------------------- >> > > -- > *** Note new email address *** > Ed Merkle, PhD > Assistant Professor > Department of Psychological Sciences (starting August 2011) > University of Missouri > Columbia, MO, USA 65211 > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.