Hi Toke,

Thank you for the reply!

Both single-value-with-semi-colon-tokenizer and multi-value-untokenized
have static warming queries in place.  In fact, that was the first thing I
did to improve performance.

Below is my warming queries in solrconfig.xml.

<listener event="newSearcher" class="solr.QuerySenderListener">
            <arr name="queries">
                <lst> <!-- begin: static warming for facets -->
                    <str name="facet.field">au_facet</str>
                    <str name="facet.field">per_facet</str>
                    <str name="facet.field">org_facet</str>
<str name="facet.field">dt</str>
                    <str name="facet.field">brd</str>
                    <str name="facet.pivot">industry,source_facet</str>
                    <str
name="facet.pivot">availability,availability_status</str>
                    <str name="qt">search</str>
                    <str name="facet">true</str>
                    <str name="f.au_facet.facet.limit">5</str>
<str name="f.per_facet.facet.limit">5</str>
                    <str name="f.org_facet.facet.limit">5</str>
                    <str name="f.dg.facet.limit">5</str>
                    <str name="f.dt.facet.limit">5</str>
                </lst> <!-- end: static warming for facets -->
            </arr>
        </listener>
        <listener event="firstSearcher" class="solr.QuerySenderListener">
            <arr name="queries">
                <lst> <!-- begin: static warming for facets -->
                    <str name="facet.field">au_facet</str>
                    <str name="facet.field">per_facet</str>
                    <str name="facet.field">org_facet</str>
<str name="facet.field">dt</str>
                    <str name="facet.field">brd</str>
                    <str name="facet.pivot">industry,source_facet</str>
                    <str
name="facet.pivot">availability,availability_status</str>
                    <str name="qt">search</str>
                    <str name="facet">true</str>
                    <str name="f.au_facet.facet.limit">5</str>
                    <str name="f.per_facet.facet.limit">5</str>
                    <str name="f.org_facet.facet.limit">5</str>
                    <str name="f.dg.facet.limit">5</str>
                    <str name="f.dt.facet.limit">5</str>
                </lst> <!-- end: static warming for facets -->
            </arr>
        </listener>


As for cardinality, for example, the per_facet field (person facet) has
4,627,056 unique terms for 14,000,000 documents.

Maybe my warming queries are not correct?  I just don't get why
multi-valued-untokenized field yielded such a performance improvement. I
guess it doesn't make sense to you either :)

I will definitely give the docValues a try to see if it further improves
the performance.


Rebecca Tang
Applications Developer, UCSF CKM
Legacy Tobacco Document Library <legacy.library.ucsf.edu/>
E: rebecca.t...@ucsf.edu




On 6/13/14 1:24 PM, "Toke Eskildsen" <t...@statsbiblioteket.dk> wrote:

>Tang, Rebecca [rebecca.t...@ucsf.edu] wrote:
>> I have an solr index with 14+ million records.  We facet on quite a few
>>fields with very
>> high-cardinality such as author, person, organization, brand and
>>document type.  Some
>> of the records contain thousands of persons and organizations.  So the
>>person and
>> organization fields can be very large.
>
>How many unique values per field in the full index are we talking? Just
>approximately.
>
>> After this change, the performance improved drastically. But I can't
>>understand why
>> building these fields as multi-valued field vs. single-valued field
>>with semicolon
>> tokenizer can have such a dramatic performance difference.
>
>It should not. I suspect something else is happening. 10 minutes does not
>sound unrealistic if it is your first query after and index update. Maybe
>your measurement for tokenized was unwarmed and your measurement for
>un-tokenized warmed? Could you give an example of a full query?
>
>Anyway, you should definitely be using DocValues for such high
>cardinality facet-fields.
>
>Depending on your usage pattern and where the bottleneck is,
>https://issues.apache.org/jira/browse/SOLR-5894 might also help.
>
>- Toke Eskildsen


Reply via email to