On 2/9/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
If you exclude both the high df
counts from the tree, and the "bits" they contribute, then it becomes
mandatory to calculate the intersections for those high df terms.  It
also will hopefully act as a good boostrap to raise the min_df of the
queue and allow you to prune earlier in the tree based on the nodes
max_df or intersectionSize(union, baseDocSet).

It occurs to me that iterating over all the high-df terms and getting
their intersection count could also be more efficient... (if you take
intersection counts with many top-terms, say 1000, that will take up a
lot of time right there before you even get to the tree).

The high df terms could also be put into a facet tree.  Perhaps they
may even be added as nodes to the same facet tree - it depends on the
traversal algorithm (we may want the high-df terms checked first,
since those would be more efficient to access, using a specific term
and calling numDocs(), rather than being contained in a block of
terms.)

I probably won't implement it that way first... I want to see what
improvement we can get with a basic tree implementation.

-Yonik

Reply via email to