The first thing that I would recommend is to avoid the "formula interface" to models. The internals that R uses to create matrices form a formula+data set are not efficient. If you had a large number of variables, I would have automatically pointed to that as a source of issues. cforest and ctree only have formula interfaces though, so you are stuck on that one. The randomForest package has both interfaces, so that might be better.
Probably the issue is the depth of the trees. With that many observations, you are likely to get extremely deep trees. You might try limiting the depth of the tree and see if that has an effect on performance. We run into these issues with large compound libraries; in those cases we do whatever we can to avoid ensembles of trees or kernel methods. If you want those, you might need to write your own code that is hyper-efficient and tuned to your particular data structure (as we did). On another note... are this many observations really needed? You have 40ish variables; I suspect that >1M points are pretty densely packed into 40-dimensional space. Do you loose much by sampling the data set or allocating a large portion to a test set? If you have thousands of predictors, I could see the need for so many observations, but I'm wondering if many of the samples are redundant. Max On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane <mlok...@gmail.com> wrote: > Answers added below. > Thanks again, > Matt > > On 11 June 2010 14:28, Max Kuhn <mxk...@gmail.com> wrote: >> >> Also, you have not said: >> >> - your OS: Windows Server 2003 64-bit >> - your version of R: 2.11.1 64-bit >> - your version of party: 0.9-9995 > > >> >> - your code: test.cf <-(formula=badflag~.,data = >> example,control=cforest_control > > (teststat = 'max', testtype = > 'Teststatistic', replace = FALSE, ntree = 500, savesplitstats = FALSE,mtry = > 10)) > >> - what "Large data set" means: > 1 million observations, 40+ variables, >> around 200MB >> - what "very large model objects" means - anything which breaks >> >> So... how is anyone suppose to help you? >> >> Max > > -- Max ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.