On Mon, 2020-03-09 at 10:39 +0100, Nicolas Paris wrote:
> I want to provide terms facet on a string multivalue field.
> ...
> How to improve brute performances ?

It might help to have everything in a single shard, to avoid the
secondary fine count. But your index is rather large for single-shard
so that might have a negative impact on overall speed.

JSON faceting allows you to skip the fine counting with the parameter
refine: 
https://lucene.apache.org/solr/guide/8_4/json-facet-api.html#terms-facet
Should be easy to try.

> I am wondering how I could filter the documents to get approximate
> facets ?

Clunky idea: Introduce a hash field for each document. When you need
the heavy facet call, start by a search just to count the number of
documents. If is is too high, add a prefix-filter for the hash-field
with a random hex value.

1M hits:
q=foo
-> Facets for 1M documents

10M hits:
q=foo&fq=hash:1*
-> Facets for 620K documents (10M/16)

100M hits:
q=foo&fq=hash:ab*
-> Facets for 390K documents (100M/256)

If you want it more fine-grained to hit closer to your ~2M limit, you
could add a bit more filter logic:

100M hits:
q=foo&fq=hash:00* OR hash:01* OR hash:02* OR hash:03* OR hash:04*
-> Facets for 1950K documents (100M/256 * 5)

Prefix queries might prove to be too expensive, so you could also
create fields with random values from 0-9, 0-99, 0-999 etc. and do
exact match filtering on those to get the number of hits down.


- Toke Eskildsen, Royal Danish Library


Reply via email to