adfel70 <adfe...@gmail.com> wrote:
> I am trying to understand why faceting on a field with lots of unique values
> has a great impact on query performance.

Faceting in Solr is performed in different ways. String faceting different from 
Numerics faceting, DocValued fields different from non-DocValued, fc different 
from enum. Let's look at String faceting with facet.method=fc and DocValues.

Strings (aka Terms) are represented in the faceting code with an ordinal, which 
is really just a number. The first term has number 0, the next number 1 and so 
forth. When doing a faceting call with the above premise, what happens is

1) An counter of int[unique_values] is allocated.
This is fairly fast, but still with a noticeable impact when the number of 
unique value creeps into the millions. On our machine it takes several hundred 
milliseconds for 100M values. Also relevant is the overall strain it puts on 
the garbage collector.

2) For each hit in the result set, the corresponding ordinals are resolved and 
counter[ordinal]++ is triggered.
This scales with the result set. Small sets are very fast, quite independent of 
the size of the counter-structure. Large result sets are (naturally) equally 
slow.

3) The counter-structure is iterated and top-X are determined.
This scales with the size of the counter-structure, (nearly) independent of the 
result set size.

4) The Terms for the top-X ordinals are resolved from the index.
This scales with X.


Some of these parts has some non-intuitive penalties: Even very tiny result 
sets has aa constant overhead from allocation and iteration. Asking for top-1M 
hits means that the underlying priority queue will probably no longer fit in 
the CPU cache and will slow things down. Resolving Terms from ordinals relies 
of fast IO and a large number of unique Terms might mean that the disk cache is 
not large enough.


Blatant plug: I have spend a fair amount of time trying to make some of this 
faster http://tokee.github.io/lucene-solr/

- Toke Eskildsen

Reply via email to