Hi Toke, Thank you for the reply!
Both single-value-with-semi-colon-tokenizer and multi-value-untokenized have static warming queries in place. In fact, that was the first thing I did to improve performance. Below is my warming queries in solrconfig.xml. <listener event="newSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst> <!-- begin: static warming for facets --> <str name="facet.field">au_facet</str> <str name="facet.field">per_facet</str> <str name="facet.field">org_facet</str> <str name="facet.field">dt</str> <str name="facet.field">brd</str> <str name="facet.pivot">industry,source_facet</str> <str name="facet.pivot">availability,availability_status</str> <str name="qt">search</str> <str name="facet">true</str> <str name="f.au_facet.facet.limit">5</str> <str name="f.per_facet.facet.limit">5</str> <str name="f.org_facet.facet.limit">5</str> <str name="f.dg.facet.limit">5</str> <str name="f.dt.facet.limit">5</str> </lst> <!-- end: static warming for facets --> </arr> </listener> <listener event="firstSearcher" class="solr.QuerySenderListener"> <arr name="queries"> <lst> <!-- begin: static warming for facets --> <str name="facet.field">au_facet</str> <str name="facet.field">per_facet</str> <str name="facet.field">org_facet</str> <str name="facet.field">dt</str> <str name="facet.field">brd</str> <str name="facet.pivot">industry,source_facet</str> <str name="facet.pivot">availability,availability_status</str> <str name="qt">search</str> <str name="facet">true</str> <str name="f.au_facet.facet.limit">5</str> <str name="f.per_facet.facet.limit">5</str> <str name="f.org_facet.facet.limit">5</str> <str name="f.dg.facet.limit">5</str> <str name="f.dt.facet.limit">5</str> </lst> <!-- end: static warming for facets --> </arr> </listener> As for cardinality, for example, the per_facet field (person facet) has 4,627,056 unique terms for 14,000,000 documents. Maybe my warming queries are not correct? I just don't get why multi-valued-untokenized field yielded such a performance improvement. I guess it doesn't make sense to you either :) I will definitely give the docValues a try to see if it further improves the performance. Rebecca Tang Applications Developer, UCSF CKM Legacy Tobacco Document Library <legacy.library.ucsf.edu/> E: rebecca.t...@ucsf.edu On 6/13/14 1:24 PM, "Toke Eskildsen" <t...@statsbiblioteket.dk> wrote: >Tang, Rebecca [rebecca.t...@ucsf.edu] wrote: >> I have an solr index with 14+ million records. We facet on quite a few >>fields with very >> high-cardinality such as author, person, organization, brand and >>document type. Some >> of the records contain thousands of persons and organizations. So the >>person and >> organization fields can be very large. > >How many unique values per field in the full index are we talking? Just >approximately. > >> After this change, the performance improved drastically. But I can't >>understand why >> building these fields as multi-valued field vs. single-valued field >>with semicolon >> tokenizer can have such a dramatic performance difference. > >It should not. I suspect something else is happening. 10 minutes does not >sound unrealistic if it is your first query after and index update. Maybe >your measurement for tokenized was unwarmed and your measurement for >un-tokenized warmed? Could you give an example of a full query? > >Anyway, you should definitely be using DocValues for such high >cardinality facet-fields. > >Depending on your usage pattern and where the bottleneck is, >https://issues.apache.org/jira/browse/SOLR-5894 might also help. > >- Toke Eskildsen