Re: What is “high cardinality” in facet streams?

Alfonso Muñoz-Pomer Fuentes Wed, 21 Feb 2018 11:12:01 -0800

Thank you for all the information. As usual I learn a lot about Solr and data 
modeling with each reply.


Some more details about my collection:
- Approximately 200M documents
- 1.2M different values in the field I’m faceting over

The query I’m doing is over a single bucket, which after applying q and fq the 
1.2M values are reduced to, at most 60K (often times half that value). From 
your replies I assume I’m not going to hit a bottleneck any time soon. Thanks a 
lot.

> On 20 Feb 2018, at 18:03, Joel Bernstein <joels...@gmail.com> wrote:
> 
> The rollup streaming expression rolls up aggregations on a stream that has
> been sorted by the group by fields. This is basically a MapReduce reduce
> operation and can work with extremely high cardinality (basically
> unlimited). The rollup function is designed to rollup data produced by the
> /export handler which can also sort data sets with very high cardinality.
> The docs should describe the correct usage of the rollup expression with
> the /export handler.
> 
> Joel Bernstein
> http://joelsolr.blogspot.com/
> 
> On Tue, Feb 20, 2018 at 11:10 AM, Shawn Heisey <apa...@elyograg.org> wrote:
> 
>> On 2/20/2018 4:44 AM, Alfonso Muñoz-Pomer Fuentes wrote:
>> 
>>> We have a query that we can resolve using either facet or search with
>>> rollup. In the Stream Source Reference section of Solr’s Reference Guide (
>>> https://lucene.apache.org/solr/guide/7_1/stream-source-refe
>>> rence.html#facet) it says “To support high cardinality aggregations see
>>> the rollup function”. I was wondering what it’s considered “high
>>> cardinality”. If it serves, our query returns up to 60k results. I haven’t
>>> got to do any benchmarking to see if there’s any difference, though,
>>> because facet so far performs very well, but I don’t know if I’m near the
>>> “tipping point”. Any feedback would be appreciated.
>>> 
>> 
>> There's no hard and fast rule for this.  The tipping point is going to be
>> different for every use case.  With a little bit of information about your
>> setup, experienced users can make an educated guess about whether or not
>> performance will be good, but cannot say with absolute certainty what
>> you're going to run into.
>> 
>> Let's start with some definitions, which you may or may not already know:
>> 
>> https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
>> https://en.wikipedia.org/wiki/Cardinality
>> 
>> You haven't said how many unique values are in your field.  The only
>> information I have from you is 60K results from your queries, which may or
>> may not have any bearing on the total number of documents in your index, or
>> the total number of unique values in the field you're using for faceting.
>> So the next paragraph may or may not apply to your index.
>> 
>> In general, 60,000 unique values in a field would be considered very low
>> cardinality, because computers can typically operate on 60,000 values
>> *very* quickly, unless the size of each value is enormous.  But if the
>> index has 60,000 total documents, then *in relation to other data*, the
>> cardinality is very high, even though most people would say the opposite.
>> Sixty thousand documents or unique values is almost always a very small
>> index, not prone to performance issues.
>> 
>> The warnings about cardinality in the Solr documentation mostly refer to
>> *absolute* cardinality -- how many unique values there are in a field,
>> regardless of the actual number of documents.  If there are millions or
>> billions of unique values, then operations like facets, grouping, sorting,
>> etc are probably going to be slow.  If there are a lot less, such as
>> thousands or only a handful, then those operations are likely to be very
>> fast, because the computer will have less information it must process.
>> 
>> Thanks,
>> Shawn
>> 
>> 

--
Alfonso Muñoz-Pomer Fuentes
Senior Lead Software Engineer @ Expression Atlas Team
European Bioinformatics Institute (EMBL-EBI)
European Molecular Biology Laboratory
Tel:+ 44 (0) 1223 49 2633
Skype: amunozpomer

Re: What is “high cardinality” in facet streams?

Reply via email to