Re: Facets based on sampling

Toke Eskildsen Tue, 24 Oct 2017 22:59:23 -0700

John Davis <johndavis925...@gmail.com> wrote:
> 100M unique values might be across all docs, and unless the faceting
> implementation is really naive I cannot see how that can come into play
> when the query matches a fraction of those.


Solr simple string faceting uses an int-array to hold counts for the different 
terms in the facet. This array has the same length as the number of unique 
terms, which means 100M in your case (divided among the shards). In order to 
extract top-n, those 100M entries are iterated. Nearly all of them are 0 when 
the query result is small, but the implementation still requires the iteration 
of all array-entries. With low- to medium-cardinality (let's say up to a few 
million) this is normally not noticeable, but as cardinality goes up it takes 
its toll.

Number-based faceting uses a hashmap instead, but this approach scales poorly 
when the result set gets large (millions). I had great results with a structure 
that tracked the updated counters with Solr 4.10 and is in the process of 
porting to Solr 7. No promises of when that will finish and especially not 
about when/if it will be production-quality. Detailed description of the 
tracking idea at 
https://sbdevel.wordpress.com/2014/03/17/fast-faceting-with-high-cardinality-and-small-result-set/

- Toke Eskildsen

Re: Facets based on sampling

Reply via email to