Re: facet optimizing

Erik Hatcher Thu, 08 Feb 2007 12:04:39 -0800

And to add some fuel to this fire, I'm seeing in the (first 100k ofUVa MARC records) data I'm processing that the facets are sparse withdocuments. There are a lot of documents that simply don't have asubject genre on them, for example... like almost 50%. Maybe thedata will get cleaner as I load. I figured mentioning the sparenessof the data in facets is likely typical.


        Erik



On Feb 8, 2007, at 2:54 PM, Yonik Seeley wrote:

A little more brainstorming on this...
pruning by df is going to be one of the most important features
here... so a variation (or optimization) would be to keep a list of
the highest terms by df, and then build the facet tree excluding those
top terms.  That should lower the dfs in the tree nodes and allow more
pruning.

-Yonik


On 2/7/07, Yonik Seeley <[EMAIL PROTECTED]> wrote:

Heh...
I'm still thinking/brainstorming about it... it only helps if you can
effectively prune though.

Each node in the tree could also keep the max docfreq of any ofit's children.

If either the facet count or max doc freq for a node is less than the
minimum facet count in your priority queue, then you don't have to
expand that node.

One other piece of info you can glean: If your branching factor is10,

and the facet count for a node is 430, then you know that you are

guaranteed at least a count of 43 at the next level. I'm not surehow

one could use that info though (beyond just picking the highest node
anyway).

The facet counts at nodes could be used to direct where you expand
first, but can't really be used to prune like the docfreq can.

Pruning won't work well when the base docset size is small... but if
it's small enough, we can use other strategies for that.  It also
won't work well for fields where higher docfreqs are rare... author
*might* be a case of that.

If we are making our own hierarchy, we can order it to be moreeffective.

For instance, the node containing terms 1-10 doesn't have to be a

sibling of the node with terms 11-20... the hard part is figuringhow

we want to arrange them.

At the leaves... perhaps putting nodes with high maxDf together???
That might make it more likely you could prune off other nodes
quicker.  It seems like one would want to minimize set overlap (i.e.
minimize the total cardinallity of all the sets at any one level).
But that doesn't seem easy to do in a timely manner.

So, off the top of my head right now, I guess maybe just sort the
leaves by maxDf and build the rest of the tree on top of that.

Thinking all this stuff up from scratch seems like the hard way...
Does anyone know how other people have implemented this stuff?

-Yonik

Re: facet optimizing

Reply via email to