A little more brainstorming on this...
pruning by df is going to be one of the most important features
here... so a variation (or optimization) would be to keep a list of
the highest terms by df, and then build the facet tree excluding those
top terms.  That should lower the dfs in the tree nodes and allow more
pruning.

-Yonik


On 2/7/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
Heh...
I'm still thinking/brainstorming about it... it only helps if you can
effectively prune though.
Each node in the tree could also keep the max docfreq of any of it's children.
If either the facet count or max doc freq for a node is less than the
minimum facet count in your priority queue, then you don't have to
expand that node.

One other piece of info you can glean: If your branching factor is 10,
and the facet count for a node is 430, then you know that you are
guaranteed at least a count of 43 at the next level.  I'm not sure how
one could use that info though (beyond just picking the highest node
anyway).

The facet counts at nodes could be used to direct where you expand
first, but can't really be used to prune like the docfreq can.

Pruning won't work well when the base docset size is small... but if
it's small enough, we can use other strategies for that.  It also
won't work well for fields where higher docfreqs are rare... author
*might* be a case of that.

If we are making our own hierarchy, we can order it to be more effective.
For instance, the node containing terms 1-10 doesn't have to be a
sibling of the node with terms 11-20...  the hard part is figuring how
we want to arrange them.

At the leaves... perhaps putting nodes with high maxDf together???
That might make it more likely you could prune off other nodes
quicker.  It seems like one would want to minimize set overlap (i.e.
minimize the total cardinallity of all the sets at any one level).
But that doesn't seem easy to do in a timely manner.

So, off the top of my head right now, I guess maybe just sort the
leaves by maxDf and build the rest of the tree on top of that.

Thinking all this stuff up from scratch seems like the hard way...
Does anyone know how other people have implemented this stuff?

-Yonik

Reply via email to