John Davis <johndavis925...@gmail.com> wrote: > 100M unique values might be across all docs, and unless the faceting > implementation is really naive I cannot see how that can come into play > when the query matches a fraction of those.
Solr simple string faceting uses an int-array to hold counts for the different terms in the facet. This array has the same length as the number of unique terms, which means 100M in your case (divided among the shards). In order to extract top-n, those 100M entries are iterated. Nearly all of them are 0 when the query result is small, but the implementation still requires the iteration of all array-entries. With low- to medium-cardinality (let's say up to a few million) this is normally not noticeable, but as cardinality goes up it takes its toll. Number-based faceting uses a hashmap instead, but this approach scales poorly when the result set gets large (millions). I had great results with a structure that tracked the updated counters with Solr 4.10 and is in the process of porting to Solr 7. No promises of when that will finish and especially not about when/if it will be production-quality. Detailed description of the tracking idea at https://sbdevel.wordpress.com/2014/03/17/fast-faceting-with-high-cardinality-and-small-result-set/ - Toke Eskildsen