Thanks so much for your reply Hoss, I didn't realize how much more complicated this gets with distributed search. Do you think it's worth opening a JIRA issue for this? Is there already some ongoing work on the faceting code that this might fit in with?
In the meantime, I think I'll go ahead and do some performance tests on my kludge. That might work for us as an interim measure until I have time to dive into the Solr/Lucene distributed faceting code. Tom -----Original Message----- From: Chris Hostetter [mailto:hossman_luc...@fucit.org] Sent: Friday, September 30, 2011 9:20 PM To: solr-user@lucene.apache.org Subject: RE: Getting facet counts for 10,000 most relevant hits : I figured out how to do this in a kludgey way on the client side but it : seems this could be implemented much more efficiently at the Solr/Lucene : level. I described my kludge and posted a question about this to the It can, and I have -- but only for the case of a single node... In general the faceting code in solr just needs a DocSet. the default imple uses the DocSet computed as aside effect when executing the main search, but a custom SearchComponent could pick any DocSet it wants. A few years back I wrote a custom faceting plugin that computed a "score" for each constraint based on: * Editorially assigned weights from a config file * the number of matching documents (ie: normal constraint count) * the number of matching documents from hte first N results ...where the last number was determined by internally executing the search with "rows" of N, to generate a DocList object, nad then converting that DocList into a DocSet, and using that as the input to SimpleFacetCounts. Ignoring the "Editorial weights" part of the above, the logic for "scoring" constraints based on the other two factors is general enough thta it could be implemented in solr, we just need a way to configure "N" and what kind of function should be applied to the two counts. ...But... This approach really breaks down in a distributed model. You can't do the same quick and easy DocList->DocSet transformation on each node, you have to do more complicated federating logic like the existing FacetComponent code does, and even there we don't have anything that would help with the "only the first N" type logic. My best idea would be to do the same thing you describe in your "kludge" approach to solving this in the client... : (http://lucene.472066.n3.nabble.com/Solr-should-provide-an-option-to-show-only-most-relevant-facet-values-tc3374285.html). ...the coordinator would have to query all of the shards for their top N, and then tell each one exactly which of those docs to include in the "weighted facets constraints" count ... which would make for some relaly big requests if N is large. the only sane way to do this type of thing efficiently in a distributed setup would probably be to treat the "top N" part of the goal as a "guideline" for a sampling problem, telling each shard to consider only *their* top N results when computing the top facets in shardReq #1, and then do the same "give me an exact count" type logic in shardReq #2 thta we already do. So the constraints picked may not acutally be the top constraints for the first N docs across the whole collection (just like right now they aren't garunteed to be the top constraints for all docs in the collection in a long tail situation), but they would representative of the "first-ish" docs across the whole collection. -Hoss