Thank you for all the information. As usual I learn a lot about Solr and data modeling with each reply.
Some more details about my collection: - Approximately 200M documents - 1.2M different values in the field I’m faceting over The query I’m doing is over a single bucket, which after applying q and fq the 1.2M values are reduced to, at most 60K (often times half that value). From your replies I assume I’m not going to hit a bottleneck any time soon. Thanks a lot. > On 20 Feb 2018, at 18:03, Joel Bernstein <joels...@gmail.com> wrote: > > The rollup streaming expression rolls up aggregations on a stream that has > been sorted by the group by fields. This is basically a MapReduce reduce > operation and can work with extremely high cardinality (basically > unlimited). The rollup function is designed to rollup data produced by the > /export handler which can also sort data sets with very high cardinality. > The docs should describe the correct usage of the rollup expression with > the /export handler. > > Joel Bernstein > http://joelsolr.blogspot.com/ > > On Tue, Feb 20, 2018 at 11:10 AM, Shawn Heisey <apa...@elyograg.org> wrote: > >> On 2/20/2018 4:44 AM, Alfonso Muñoz-Pomer Fuentes wrote: >> >>> We have a query that we can resolve using either facet or search with >>> rollup. In the Stream Source Reference section of Solr’s Reference Guide ( >>> https://lucene.apache.org/solr/guide/7_1/stream-source-refe >>> rence.html#facet) it says “To support high cardinality aggregations see >>> the rollup function”. I was wondering what it’s considered “high >>> cardinality”. If it serves, our query returns up to 60k results. I haven’t >>> got to do any benchmarking to see if there’s any difference, though, >>> because facet so far performs very well, but I don’t know if I’m near the >>> “tipping point”. Any feedback would be appreciated. >>> >> >> There's no hard and fast rule for this. The tipping point is going to be >> different for every use case. With a little bit of information about your >> setup, experienced users can make an educated guess about whether or not >> performance will be good, but cannot say with absolute certainty what >> you're going to run into. >> >> Let's start with some definitions, which you may or may not already know: >> >> https://en.wikipedia.org/wiki/Cardinality_(data_modeling) >> https://en.wikipedia.org/wiki/Cardinality >> >> You haven't said how many unique values are in your field. The only >> information I have from you is 60K results from your queries, which may or >> may not have any bearing on the total number of documents in your index, or >> the total number of unique values in the field you're using for faceting. >> So the next paragraph may or may not apply to your index. >> >> In general, 60,000 unique values in a field would be considered very low >> cardinality, because computers can typically operate on 60,000 values >> *very* quickly, unless the size of each value is enormous. But if the >> index has 60,000 total documents, then *in relation to other data*, the >> cardinality is very high, even though most people would say the opposite. >> Sixty thousand documents or unique values is almost always a very small >> index, not prone to performance issues. >> >> The warnings about cardinality in the Solr documentation mostly refer to >> *absolute* cardinality -- how many unique values there are in a field, >> regardless of the actual number of documents. If there are millions or >> billions of unique values, then operations like facets, grouping, sorting, >> etc are probably going to be slow. If there are a lot less, such as >> thousands or only a handful, then those operations are likely to be very >> fast, because the computer will have less information it must process. >> >> Thanks, >> Shawn >> >> -- Alfonso Muñoz-Pomer Fuentes Senior Lead Software Engineer @ Expression Atlas Team European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Tel:+ 44 (0) 1223 49 2633 Skype: amunozpomer