Direction corresponds to goodness: for the split represented by goodness[i], direction[i]=-1 means that values less than the split at goodness[i] will go left, greater than will go right. If direction[i] = 1 then they will be sent to opposite sides.
The long-and-short of it is that, for most trees, we want to send splits smaller than the split value left, and greater than right, so direction should be -1 for all values, ie, direction = rep(-1,length(goodness). The vector is only added if you want to customize the structure of your tree. Hope that helps, Sam On Jan 3, 2007 12:56 PM, Paolo Radaelli <[EMAIL PROTECTED]> wrote: > Dear all, > I'm trying to manage with user defined split function in rpart > (file rpart\tests\usersplits.R in > http://cran.r-project.org/src/contrib/rpart_3.1-34.tar.gz - see bottom of > the email). > Suppose to have the following data.frame (note that x's values are already > sorted) > > D > y x > 1 7 0.428 > 2 3 0.876 > 3 1 1.467 > 4 6 1.492 > 5 3 1.703 > 6 4 2.406 > 7 8 2.628 > 8 6 2.879 > 9 5 3.025 > 10 3 3.494 > 11 2 3.496 > 12 6 4.623 > 13 4 4.824 > 14 6 4.847 > 15 2 6.234 > 16 7 7.041 > 17 2 8.600 > 18 4 9.225 > 19 5 9.381 > 20 8 9.986 > > Running rpart and setting minbucket=1 and maxdepth=1 we get the following > tree (which uses, by default, deviance): > > rpart(D$y~D$x,control=rpart.control(minbucket=1,maxdepth=1)) > n= 20 > node), split, n, deviance, yval * denotes terminal node > 1) root 20 84.80000 4.600000 > 2) D$x< 9.6835 19 72.63158 4.421053 * > 3) D$x>=9.6835 1 0.00000 8.000000 * > > This means that the first 19 observation has been sent to the left side of > the tree and one observation to the right. > This is correct when we observe goodness (the maximum is the last element of > the vector). > > The thing i really don't understand is the direction vector. > # direction= -1 = send "y< cutpoint" to the left side of the tree > # 1 = send "y< cutpoint" to the right > > What does it mean ? > In the example here considered we have > > sign(lmean) > [1] 1 1 -1 -1 -1 -1 -1 1 1 1 -1 -1 -1 -1 -1 -1 -1 -1 -1 > > Which is the criterion used ? > In my opinion we should have all the values equal to -1 given that they have > to be sent to left side of the tree. > Does someone can help me ? > Thank you > > ####################################################### > # The split function, where most of the work occurs. > # Called once per split variable per node. > # If continuous=T (the case here considered) > # The actual x variable is ordered > # y is supplied in the sort order of x, with no missings, > # return two vectors of length (n-1): > # goodness = goodness of the split, larger numbers are better. > # 0 = couldn't find any worthwhile split > # the ith value of goodness evaluates splitting obs 1:i vs (i+1):n > # direction= -1 = send "y< cutpoint" to the left side of the tree > # 1 = send "y< cutpoint" to the right > # this is not a big deal, but making larger "mean y's" move towards > # the right of the tree, as we do here, seems to make it easier to > # read > # If continuos=F, x is a set of integers defining the groups for an > # unordered predictor. In this case: > # direction = a vector of length m= "# groups". It asserts that the > # best split can be found by lining the groups up in this order > # and going from left to right, so that only m-1 splits need to > # be evaluated rather than 2^(m-1) > # goodness = m-1 values, as before. > # > # The reason for returning a vector of goodness is that the C routine > # enforces the "minbucket" constraint. It selects the best return value > # that is not too close to an edge. > The vector wt of weights in our case is: > > wt > [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 > > temp2 <- function(y, wt, x, parms, continuous) { > # Center y > n <- length(y) > y <- y- sum(y*wt)/sum(wt) > if (continuous) { > # continuous x variable > temp <- cumsum(y*wt)[-n] > left.wt <- cumsum(wt)[-n] > right.wt <- sum(wt) - left.wt > lmean <- temp/left.wt > rmean <- -temp/right.wt > goodness <- (left.wt*lmean^2 + right.wt*rmean^2)/sum(wt*y^2) > list(goodness= goodness, direction=sign(lmean)) > } > } > > Paolo Radaelli > Dipartimento di Metodi Quantitativi per le Scienze Economiche ed Aziendali > Facoltà di Economia > Università degli Studi di Milano-Bicocca > P.zza dell'Ateneo Nuovo, 1 > 20126 Milano > Italy > e-mail [EMAIL PROTECTED] > > ______________________________________________ > [EMAIL PROTECTED] mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.