On 2/9/07, Chris Hostetter <[EMAIL PROTECTED]> wrote:
I freely admit that i'm totally lost on most of what you're suggestion ...
it seems like you're suggesting that organizing the terms in a facet field
into a tree structure would help us know which terms to compute the
counts for first for a given query -- but it's not clear to me why that
would be the case -- the terms don't neccessarily have any correlation.

They don't necessarily have to (I think)... we propagate information up the tree
in terms of max_df,  and in terms of the union of all the children.
There may be other info we can propagate as well... but I'm not sure what yet.
We need a set-theory mathematician or something :-)

It would be best, of course, to group the lowest level nodes by the
amount they have in common (the correlation you were looking for), but
that sounds prohibitively expensive.
It would *not* be expensive, however, to group the lowest level nodes
by max_df or union_size.

: pruning by df is going to be one of the most important features
: here... so a variation (or optimization) would be to keep a list of
: the highest terms by df, and then build the facet tree excluding those
: top terms.  That should lower the dfs in the tree nodes and allow more
: pruning.

i'm not sure why excluding high DF terms helps your approach, but one of
the optimizations i anticipated when we first added term faceting was to
build a cache of the high DF terms and test them first -- if you only want
the 20 facet terms with the highest counts, and after computing counts for
the 100 highest DF terms you find your lowest count (#20) to be 678, and
the DF of the 100th highest DF term in your cache was 677 then you are
garunteed you don't need to check any other terms (which by definition
have lower DFs)

...so if extracting the high DF terms helps whatever complex tree walking
you are thinking of, then checking those high DF terms first might save
you the hassle of walking the tree at all.

Yes, that's a required part of it.  If you exclude both the high df
counts from the tree, and the "bits" they contribute, then it becomes
mandatory to calculate the intersections for those high df terms.  It
also will hopefully act as a good boostrap to raise the min_df of the
queue and allow you to prune earlier in the tree based on the nodes
max_df or intersectionSize(union, baseDocSet).

The tree-building code I put in the JIRA bug already extracts the top
terms and excludes them from the tree.

As you note, some of the effectiveness will depend on the distribution
of the terms.
If everything is flat, you don't get to prune much, but hopefully at
least at the bottom of the tree we can avoid checking every bucket of
terms by checking the intersection count of the union.

I really don't know how well it will work/scale, but I do think it
will do better than what we have now for high cardinality facets, so I
intend to find out.

-Yonik

Reply via email to