On Mon, 2020-03-09 at 10:39 +0100, Nicolas Paris wrote: > I want to provide terms facet on a string multivalue field. > ... > How to improve brute performances ?
It might help to have everything in a single shard, to avoid the secondary fine count. But your index is rather large for single-shard so that might have a negative impact on overall speed. JSON faceting allows you to skip the fine counting with the parameter refine: https://lucene.apache.org/solr/guide/8_4/json-facet-api.html#terms-facet Should be easy to try. > I am wondering how I could filter the documents to get approximate > facets ? Clunky idea: Introduce a hash field for each document. When you need the heavy facet call, start by a search just to count the number of documents. If is is too high, add a prefix-filter for the hash-field with a random hex value. 1M hits: q=foo -> Facets for 1M documents 10M hits: q=foo&fq=hash:1* -> Facets for 620K documents (10M/16) 100M hits: q=foo&fq=hash:ab* -> Facets for 390K documents (100M/256) If you want it more fine-grained to hit closer to your ~2M limit, you could add a bit more filter logic: 100M hits: q=foo&fq=hash:00* OR hash:01* OR hash:02* OR hash:03* OR hash:04* -> Facets for 1950K documents (100M/256 * 5) Prefix queries might prove to be too expensive, so you could also create fields with random values from 0-9, 0-99, 0-999 etc. and do exact match filtering on those to get the number of hits down. - Toke Eskildsen, Royal Danish Library