On 2/9/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
I freely admit that i'm totally lost on most of what you're suggestion ... it seems like you're suggesting that organizing the terms in a facet field into a tree structure would help us know which terms to compute the counts for first for a given query -- but it's not clear to me why that would be the case -- the terms don't neccessarily have any correlation.
They don't necessarily have to (I think)... we propagate information up the tree in terms of max_df, and in terms of the union of all the children. There may be other info we can propagate as well... but I'm not sure what yet. We need a set-theory mathematician or something :-) It would be best, of course, to group the lowest level nodes by the amount they have in common (the correlation you were looking for), but that sounds prohibitively expensive. It would *not* be expensive, however, to group the lowest level nodes by max_df or union_size.
: pruning by df is going to be one of the most important features : here... so a variation (or optimization) would be to keep a list of : the highest terms by df, and then build the facet tree excluding those : top terms. That should lower the dfs in the tree nodes and allow more : pruning. i'm not sure why excluding high DF terms helps your approach, but one of the optimizations i anticipated when we first added term faceting was to build a cache of the high DF terms and test them first -- if you only want the 20 facet terms with the highest counts, and after computing counts for the 100 highest DF terms you find your lowest count (#20) to be 678, and the DF of the 100th highest DF term in your cache was 677 then you are garunteed you don't need to check any other terms (which by definition have lower DFs) ...so if extracting the high DF terms helps whatever complex tree walking you are thinking of, then checking those high DF terms first might save you the hassle of walking the tree at all.
Yes, that's a required part of it. If you exclude both the high df counts from the tree, and the "bits" they contribute, then it becomes mandatory to calculate the intersections for those high df terms. It also will hopefully act as a good boostrap to raise the min_df of the queue and allow you to prune earlier in the tree based on the nodes max_df or intersectionSize(union, baseDocSet). The tree-building code I put in the JIRA bug already extracts the top terms and excludes them from the tree. As you note, some of the effectiveness will depend on the distribution of the terms. If everything is flat, you don't get to prune much, but hopefully at least at the bottom of the tree we can avoid checking every bucket of terms by checking the intersection count of the union. I really don't know how well it will work/scale, but I do think it will do better than what we have now for high cardinality facets, so I intend to find out. -Yonik