RE: Getting facet counts for 10,000 most relevant hits

Burton-West, Tom Mon, 03 Oct 2011 11:05:55 -0700

Thanks so much for your reply Hoss,

I didn't realize how much more complicated this gets with distributed search. 
Do you think it's worth opening a JIRA issue for this?
Is there already some ongoing work on the faceting code that this might fit in 
with?


In the meantime, I think I'll go ahead and do some performance tests on my 
kludge.  That might work for us as an interim measure until I have time to dive 
into the Solr/Lucene distributed faceting code.

Tom

-----Original Message-----
From: Chris Hostetter [mailto:hossman_luc...@fucit.org] 
Sent: Friday, September 30, 2011 9:20 PM
To: solr-user@lucene.apache.org
Subject: RE: Getting facet counts for 10,000 most relevant hits


: I figured out how to do this in a kludgey way on the client side but it 
: seems this could be implemented much more efficiently at the Solr/Lucene 
: level.  I described my kludge and posted a question about this to the 

It can, and I have -- but only for the case of a single node...

In general the faceting code in solr just needs a DocSet.  the default 
imple uses the DocSet computed as aside effect when executing the main 
search, but a custom SearchComponent could pick any DocSet it wants.

A few years back I wrote a custom faceting plugin that computed a "score" 
for each constraint based on:
 * Editorially assigned weights from a config file
 * the number of matching documents (ie: normal constraint count)
 * the number of matching documents from hte first N results

...where the last number was determined by internally executing the search 
with "rows" of N, to generate a DocList object, nad then converting that 
DocList into a DocSet, and using that as the input to SimpleFacetCounts.

Ignoring the "Editorial weights" part of the above, the logic for 
"scoring" constraints based on the other two factors is general enough 
thta it could be implemented in solr, we just need a way to configure "N" 
and what kind of function should be applied to the two counts.

        ...But...

This approach really breaks down in a distributed model.  You can't do the 
same quick and easy DocList->DocSet transformation on each node, you have 
to do more complicated federating logic like the existing FacetComponent 
code does, and even there we don't have anything that would help with the 
"only the first N" type logic.  My best idea would be to do the same thing 
you describe in your "kludge" approach to solving this in the client...

: 
(http://lucene.472066.n3.nabble.com/Solr-should-provide-an-option-to-show-only-most-relevant-facet-values-tc3374285.html).
  

...the coordinator would have to query all of the shards for their top N, 
and then tell each one exactly which of those docs to include in the 
"weighted facets constraints" count ... which would make for some relaly 
big requests if N is large.

the only sane way to do this type of thing efficiently in a distributed 
setup would probably be to treat the "top N" part of the goal as a 
"guideline" for a sampling problem, telling each shard to consider only 
*their* top N results when computing the top facets in shardReq #1, and 
then do the same "give me an exact count" type logic in shardReq #2 
thta we already do.  So the constraints picked may not acutally be 
the top constraints for the first N docs across the whole collection (just 
like right now they aren't garunteed to be the top constraints for all 
docs in the collection in a long tail situation), but they would 
representative of the "first-ish" docs across the whole collection.

-Hoss

RE: Getting facet counts for 10,000 most relevant hits

Reply via email to