And to add some fuel to this fire, I'm seeing in the (first 100k of UVa MARC records) data I'm processing that the facets are sparse with documents. There are a lot of documents that simply don't have a subject genre on them, for example... like almost 50%. Maybe the data will get cleaner as I load. I figured mentioning the spareness of the data in facets is likely typical.

        Erik


On Feb 8, 2007, at 2:54 PM, Yonik Seeley wrote:

A little more brainstorming on this...
pruning by df is going to be one of the most important features
here... so a variation (or optimization) would be to keep a list of
the highest terms by df, and then build the facet tree excluding those
top terms.  That should lower the dfs in the tree nodes and allow more
pruning.

-Yonik


On 2/7/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:
Heh...
I'm still thinking/brainstorming about it... it only helps if you can
effectively prune though.
Each node in the tree could also keep the max docfreq of any of it's children.
If either the facet count or max doc freq for a node is less than the
minimum facet count in your priority queue, then you don't have to
expand that node.

One other piece of info you can glean: If your branching factor is 10,
and the facet count for a node is 430, then you know that you are
guaranteed at least a count of 43 at the next level. I'm not sure how
one could use that info though (beyond just picking the highest node
anyway).

The facet counts at nodes could be used to direct where you expand
first, but can't really be used to prune like the docfreq can.

Pruning won't work well when the base docset size is small... but if
it's small enough, we can use other strategies for that.  It also
won't work well for fields where higher docfreqs are rare... author
*might* be a case of that.

If we are making our own hierarchy, we can order it to be more effective.
For instance, the node containing terms 1-10 doesn't have to be a
sibling of the node with terms 11-20... the hard part is figuring how
we want to arrange them.

At the leaves... perhaps putting nodes with high maxDf together???
That might make it more likely you could prune off other nodes
quicker.  It seems like one would want to minimize set overlap (i.e.
minimize the total cardinallity of all the sets at any one level).
But that doesn't seem easy to do in a timely manner.

So, off the top of my head right now, I guess maybe just sort the
leaves by maxDf and build the rest of the tree on top of that.

Thinking all this stuff up from scratch seems like the hard way...
Does anyone know how other people have implemented this stuff?

-Yonik


Reply via email to