FWIW, there is work being done for "high cardinality faceting" with some of the recent Streaming Aggregation code.
So it's at least on the way if not already there. Erick On Tue, Sep 22, 2015 at 11:44 AM, Toke Eskildsen <t...@statsbiblioteket.dk> wrote: > adfel70 <adfe...@gmail.com> wrote: >> Hi Toke, Thank you for the detailed explanation, thats exactly what I was >> looking for, except this algorithm fit single index only. could you please >> elaborate what adjustments are needed for distributed index? > > Vanilla Solr requests top-X terms from each shard, with over-provisioning. I > do not remember the exact formula (and I think it is adjustable in Solr 5), > but something like X*1.5+10? Yes, that means that correctness is not > guaranteed for distributed faceting. It would be possible to make some sort > of streaming faceting implementation, but the pathological case is that all > shards must deliver all terms to derive the correct top-X. > > The results from the shards are merged and the top-X terms are fine-counted > where needed: If we have 3 shards and asked for top-1, they might answer > shard1: [foo(3), zoo(1)] > shard2: [foo(1), zoo(1)] > shard3: [bar(2),aar(2)] > (remember the over-provisioning). We derive that foo is the top-1 term, but > since shard 3 did not provide a count for foo, we need to ask shard3 for the > count for that specific term to get the correct overall count. > > The fine-counting is performed differently from standard faceting. It is > basically 'original_query AND facet_field:fine_count_term'. Quite fast for a > few terms, but if there is a need for resolving tens or hundreds of terms for > a non-trivial index, the fine-counting phase can take longer than the initial > faceting phase. > > - Toke Eskildsen > (sorry for the delayed answer - my email reader hid your response)