Daniel Lowe created SOLR-14167:
----------------------------------

             Summary: Exact unique counts when shards contain disjoint values
                 Key: SOLR-14167
                 URL: https://issues.apache.org/jira/browse/SOLR-14167
             Project: Solr
          Issue Type: Improvement
      Security Level: Public (Default Security Level. Issues are Public)
          Components: Facet Module
            Reporter: Daniel Lowe
         Attachments: UniqueSumPerShard.java

Currently when dealing with fields with high cardinality the facet module 
offers two implementations (unique, hll) that give approximate results. There 
is one corner case where a distributed search against a high cardinality field 
should still be able to efficiently provide an exact result, that is when the 
shards are known to contain disjoint values i.e. there are duplicates within a 
shard, but no value exists on more than 1 shard.

That happens to be the case in the collection I have, but this feels to me like 
a very niche use case. Is this functionality too niche for inclusion into the 
Facet module?

I attach a naive (untested) example implementation. It could be made slightly 
more efficient if {{SlotAcc}} implementations that didn't populate the first 
100 values were used (or if this behaviour was made configurable, perhaps via 
the {{FacetContext}}?).

Slightly off topic, but the documentation currently says of unique "Beyond 100 
values it yields not exact estimate". My understanding is that this is actually 
only true when doing distributed facetting, and that it is exact for the 
non-distrubuted case.

{{UniqueAgg}} calculates {{sumUnique}}, but does not appear to actually use it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to