On Thu, 28 Jul 2011, seanstcl...@verizon.net wrote:
I am running the ctree function in R. My data has about 10 variables, many of which are categorical. 2 of the categorical variables have many levels (one has 900 levels, another has 1,000 levels). As an example, 1 of these variables is disease code and is structured as A, B, C, ...., AA, AB, AC.... Each time i've tried to run the ctree function, including these 2 variables in the data, the function never stops running. When i remove these 2 variables from the data and run without them, the function returns in about 3 seconds. Q: Is there a limit to the amount of levels that a categorical variable can contain? Is there something else that i may be overlooking?
ctree() tries to split such a variable into two groups: left and right daughter node. And there are 2^(k-1) - 1 possible groupings for a categorical variable with k levels. For k=1000 this is simply too large to be computed in finite time.
You can try to break it down to a coarser classification of levels that is still computable. Or, if the categorical variable were ordered, it needs to be declared and then only k-1 splits are possible which is small enough.
hth, Z
THanks. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.