Max, My disagreement was really just about the single statement 'I suspect that >1M points are pretty densely packed into 40-dimensional space' in your original post. On the larger issue of diminishing returns with the size of a training set, I agree with your points below.
Rich > -----Original Message----- > From: Max Kuhn [mailto:mxk...@gmail.com] > Sent: Friday, June 18, 2010 1:35 PM > To: Bert Gunter > Cc: Raubertas, Richard; Matthew OKane; r-help@r-project.org > Subject: Re: [R] Cforest and Random Forest memory use > > Rich's calculations are correct, but from a practical standpoint I > think that using all the data for the model is overkill for a few > reasons: > > - the calculations that you show implicitly assume that the predictor > values can be reliably differentiated from each other. Unless they are > deterministic calculations (e.g. number of hydrogen bonds, % GC in a > sequence) the measurement error. We don't know anything about the > context here, but in the lab sciences, the measurement variation can > make the *effective* number of predictor values much less than n. So > you can have millions of predictor values but you might only be able > to differentiate k <<<< n values reliably. > > - the important dimensionality to consider is based on how many of > those 40 are relevant to the outcome. Again, we don't now the context > of the data but there is a strong prior towards the number of > important variables being less than 40 > > - We've had to consider these types of problems a lot. We might have > 200K samples (compounds in this case) and 1000 predictors that appear > to matter. Ensembles of trees tended to do very well, as did kernel > methods. In either of those two classes of models, the prediction time > for a single new observation is very long. So we looked at how > performance was affected if we were to reduce the training set size. > In essence, we found that <50% of the data could be used with no > appreciable effect on performance. We could make the percentage > smaller if we used the predictor values to sample the data set for > prediction; if we had m samples in the training set, the next sample > added would have to have maximum dissimilarity to the existing m > samples. > > - If you are going to do any feature selection, you would be better > off segregating a percentage of those million samples as a hold-out > set to validate the selection process (a few people form Merck have > written excellent papers on the selection bias problem). Similarly, if > this is a classification problem, any ROC curve analysis is most > effective when the cutoffs are derived from a separate hold-out data > set. Just dumping all those samples in a training set seems like a > lost opportunity. > > Again, these are not refutations of your calculations. I just think > that there are plenty of non-theoretical arguments for not using all > of those values for the training set. > > Thanks, > > Max > On Fri, Jun 18, 2010 at 11:41 AM, Bert Gunter > <gunter.ber...@gene.com> wrote: > > Rich is right, of course. One way to think about it is this > (parphrased from > > the section on the "Curse of Dimensionality" from Hastie et al's > > "Statistical Learning" Book): suppose 10 uniformly > distributed points on a > > line give what you consider to be adequate coverage of the > line. Then in 40 > > dimensions, you'd need 10^40 uniformly distributed points > to give equivalent > > coverage. > > > > Various other aspects of the curse of dimensionality are > discussed in the > > book, one of which is that in high dimensions, most points > are closer to the > > boundaries then to each other. As Rich indicates, this has profound > > implications for what one can sensibly do with such data. > On example is: > > nearest neighbor procedures don't make much sense (as > nobody is likely to > > have anybody else nearby). Which Rich's little simulation nicely > > demonstrated. > > > > Cheers to all, > > > > Bert Gunter > > Genentech Nonclinical Statistics > > > > > > > > -----Original Message----- > > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] On > > Behalf Of Raubertas, Richard > > Sent: Thursday, June 17, 2010 4:15 PM > > To: Max Kuhn; Matthew OKane > > Cc: r-help@r-project.org > > Subject: Re: [R] Cforest and Random Forest memory use > > > > > > > >> -----Original Message----- > >> From: r-help-boun...@r-project.org > >> [mailto:r-help-boun...@r-project.org] On Behalf Of Max Kuhn > >> Sent: Monday, June 14, 2010 10:19 AM > >> To: Matthew OKane > >> Cc: r-help@r-project.org > >> Subject: Re: [R] Cforest and Random Forest memory use > >> > >> The first thing that I would recommend is to avoid the "formula > >> interface" to models. The internals that R uses to create matrices > >> form a formula+data set are not efficient. If you had a > large number > >> of variables, I would have automatically pointed to that > as a source > >> of issues. cforest and ctree only have formula interfaces > though, so > >> you are stuck on that one. The randomForest package has both > >> interfaces, so that might be better. > >> > >> Probably the issue is the depth of the trees. With that many > >> observations, you are likely to get extremely deep trees. You might > >> try limiting the depth of the tree and see if that has an effect on > >> performance. > >> > >> We run into these issues with large compound libraries; in > those cases > >> we do whatever we can to avoid ensembles of trees or > kernel methods. > >> If you want those, you might need to write your own code that is > >> hyper-efficient and tuned to your particular data structure (as we > >> did). > >> > >> On another note... are this many observations really > needed? You have > >> 40ish variables; I suspect that >1M points are pretty > densely packed > >> into 40-dimensional space. > > > > This did not seem right to me: 40-dimensional space is > very, very big > > and even a million observations will be thinly spread. There is probably > > some analytic result from the theory of coverage processes > about this, > > but I just did a quick simulation. If a million samples > are independently > > and randomly distributed in a 40-d unit hypercube, then > >90% of the points > > in the hypercube will be more than one-quarter of the > maximum possible > > distance (sqrt(40)) from the nearest sample. And about 40% > of the hypercube > > > > will be more than one-third of the maximum possible > distance to the nearest > > sample. So the samples do not densely cover the space at all. > > > > One implication is that modeling the relation of a response > to 40 predictors > > > > will inevitably require a lot of smoothing, even with a > million data points. > > > > Richard Raubertas > > Merck & Co. > > > >> Do you loose much by sampling the data set > >> or allocating a large portion to a test set? If you have > thousands of > >> predictors, I could see the need for so many observations, but I'm > >> wondering if many of the samples are redundant. > >> > >> Max > >> > >> On Mon, Jun 14, 2010 at 3:45 AM, Matthew OKane > >> <mlok...@gmail.com> wrote: > >> > Answers added below. > >> > Thanks again, > >> > Matt > >> > > >> > On 11 June 2010 14:28, Max Kuhn <mxk...@gmail.com> wrote: > >> >> > >> >> Also, you have not said: > >> >> > >> >> - your OS: Windows Server 2003 64-bit > >> >> - your version of R: 2.11.1 64-bit > >> >> - your version of party: 0.9-9995 > >> > > >> > > >> >> > >> >> - your code: test.cf <-(formula=badflag~.,data = > >> >> example,control=cforest_control > >> > > >> > (teststat = > >> 'max', testtype = > >> > 'Teststatistic', replace = FALSE, ntree = 500, > >> savesplitstats = FALSE,mtry = > >> > 10)) > >> > > >> >> - what "Large data set" means: > 1 million observations, > >> 40+ variables, > >> >> around 200MB > >> >> - what "very large model objects" means - anything which breaks > >> >> > >> >> So... how is anyone suppose to help you? > >> >> > >> >> Max > Notice: This e-mail message, together with any attachme...{{dropped:11}} ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.