adfel70 <adfe...@gmail.com> wrote: > I am trying to understand why faceting on a field with lots of unique values > has a great impact on query performance.
Faceting in Solr is performed in different ways. String faceting different from Numerics faceting, DocValued fields different from non-DocValued, fc different from enum. Let's look at String faceting with facet.method=fc and DocValues. Strings (aka Terms) are represented in the faceting code with an ordinal, which is really just a number. The first term has number 0, the next number 1 and so forth. When doing a faceting call with the above premise, what happens is 1) An counter of int[unique_values] is allocated. This is fairly fast, but still with a noticeable impact when the number of unique value creeps into the millions. On our machine it takes several hundred milliseconds for 100M values. Also relevant is the overall strain it puts on the garbage collector. 2) For each hit in the result set, the corresponding ordinals are resolved and counter[ordinal]++ is triggered. This scales with the result set. Small sets are very fast, quite independent of the size of the counter-structure. Large result sets are (naturally) equally slow. 3) The counter-structure is iterated and top-X are determined. This scales with the size of the counter-structure, (nearly) independent of the result set size. 4) The Terms for the top-X ordinals are resolved from the index. This scales with X. Some of these parts has some non-intuitive penalties: Even very tiny result sets has aa constant overhead from allocation and iteration. Asking for top-1M hits means that the underlying priority queue will probably no longer fit in the CPU cache and will slow things down. Resolving Terms from ordinals relies of fast IO and a large number of unique Terms might mean that the disk cache is not large enough. Blatant plug: I have spend a fair amount of time trying to make some of this faster http://tokee.github.io/lucene-solr/ - Toke Eskildsen