Re: What is “high cardinality” in facet streams?

Joel Bernstein Tue, 20 Feb 2018 10:03:34 -0800

The rollup streaming expression rolls up aggregations on a stream that has
been sorted by the group by fields. This is basically a MapReduce reduce
operation and can work with extremely high cardinality (basically
unlimited). The rollup function is designed to rollup data produced by the
/export handler which can also sort data sets with very high cardinality.
The docs should describe the correct usage of the rollup expression with
the /export handler.


Joel Bernstein
http://joelsolr.blogspot.com/

On Tue, Feb 20, 2018 at 11:10 AM, Shawn Heisey <apa...@elyograg.org> wrote:

> On 2/20/2018 4:44 AM, Alfonso Muñoz-Pomer Fuentes wrote:
>
>> We have a query that we can resolve using either facet or search with
>> rollup. In the Stream Source Reference section of Solr’s Reference Guide (
>> https://lucene.apache.org/solr/guide/7_1/stream-source-refe
>> rence.html#facet) it says “To support high cardinality aggregations see
>> the rollup function”. I was wondering what it’s considered “high
>> cardinality”. If it serves, our query returns up to 60k results. I haven’t
>> got to do any benchmarking to see if there’s any difference, though,
>> because facet so far performs very well, but I don’t know if I’m near the
>> “tipping point”. Any feedback would be appreciated.
>>
>
> There's no hard and fast rule for this.  The tipping point is going to be
> different for every use case.  With a little bit of information about your
> setup, experienced users can make an educated guess about whether or not
> performance will be good, but cannot say with absolute certainty what
> you're going to run into.
>
> Let's start with some definitions, which you may or may not already know:
>
> https://en.wikipedia.org/wiki/Cardinality_(data_modeling)
> https://en.wikipedia.org/wiki/Cardinality
>
> You haven't said how many unique values are in your field.  The only
> information I have from you is 60K results from your queries, which may or
> may not have any bearing on the total number of documents in your index, or
> the total number of unique values in the field you're using for faceting.
> So the next paragraph may or may not apply to your index.
>
> In general, 60,000 unique values in a field would be considered very low
> cardinality, because computers can typically operate on 60,000 values
> *very* quickly, unless the size of each value is enormous.  But if the
> index has 60,000 total documents, then *in relation to other data*, the
> cardinality is very high, even though most people would say the opposite.
> Sixty thousand documents or unique values is almost always a very small
> index, not prone to performance issues.
>
> The warnings about cardinality in the Solr documentation mostly refer to
> *absolute* cardinality -- how many unique values there are in a field,
> regardless of the actual number of documents.  If there are millions or
> billions of unique values, then operations like facets, grouping, sorting,
> etc are probably going to be slow.  If there are a lot less, such as
> thousands or only a handful, then those operations are likely to be very
> fast, because the computer will have less information it must process.
>
> Thanks,
> Shawn
>
>

Re: What is “high cardinality” in facet streams?

Reply via email to