On 2/9/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
If you exclude both the high df counts from the tree, and the "bits" they contribute, then it becomes mandatory to calculate the intersections for those high df terms. It also will hopefully act as a good boostrap to raise the min_df of the queue and allow you to prune earlier in the tree based on the nodes max_df or intersectionSize(union, baseDocSet).
It occurs to me that iterating over all the high-df terms and getting their intersection count could also be more efficient... (if you take intersection counts with many top-terms, say 1000, that will take up a lot of time right there before you even get to the tree). The high df terms could also be put into a facet tree. Perhaps they may even be added as nodes to the same facet tree - it depends on the traversal algorithm (we may want the high-df terms checked first, since those would be more efficient to access, using a specific term and calling numDocs(), rather than being contained in a block of terms.) I probably won't implement it that way first... I want to see what improvement we can get with a basic tree implementation. -Yonik