Re: Facet Performance

Erick Erickson Wed, 17 Jun 2020 09:55:58 -0700

queryResultCache doesn’t really help with faceting, even if it’s hit for the 
main query. 
That cache only stores a subset of the hits, and to facet properly you need 
the entire result set….


> On Jun 17, 2020, at 12:47 PM, James Bodkin <james.bod...@loveholidays.com> 
> wrote:
> 
> We've noticed that the filterCache uses a significant amount of memory, as 
> we've assigned 8GB Heap per instance.
> In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space 
> alone, further memory is required to ensure the index is always memory mapped 
> for performance reasons.
> 
> Ideally I would like to be able to reduce the amount of memory assigned to 
> the heap by using docValues instead of indexed but it doesn't seem possible.
> The QTime (after warming) for facet.method=enum is around 150-250ms whereas 
> the QTime for facet.method=fc is around 1000-1200ms.
> As we require the results in real-time for customers searching on our 
> website, the later QTime of 1000-1200ms is too slow for us to be able to use.
> 
> Our facet queries change as the customer selects different search criteria, 
> and hence the possible number of potential queries makes it very difficult 
> for the query result cache.
> We already have a custom implementation in which we check our redis cache for 
> queries before they are sent to our aggregators which runs at 30% hit rate.
> 
> Kind Regards,
> 
> James Bodkin
> 
> On 17/06/2020, 16:21, "Michael Gibney" <mich...@michaelgibney.net> wrote:
> 
>    To expand a bit on what Erick said regarding performance: my sense is
>    that the RefGuide assertion that "docValues=true" makes faceting
>    "faster" could use some qualification/clarification. My take, fwiw:
> 
>    First, to reiterate/paraphrase what Erick said: the "faster" assertion
>    is not comparing to "facet.method=enum". For low-cardinality fields,
>    if you have the heap space, and are very intentional about configuring
>    your filterCache (and monitoring it as access patterns might change),
>    "facet.method=enum" will likely be as fast as you can get (at least
>    for "legacy" facets or whatever -- not sure about "enum" method in
>    JSON facets).
> 
>    Even where "docValues=true" arguably does make faceting "faster", the
>    main benefit is that the "uninverted" data structures are serialized
>    on disk, so you're avoiding the need to uninvert each facet field
>    on-heap for every new indexSearcher, which is generally high-latency
>    -- user perception of this latency can be mitigated using warming
>    queries, but it can still be problematic, esp. for frequent index
>    updates. On-heap uninversion also inherently consumes a lot of heap
>    space, which has general implications wrt GC, etc ... so in that
>    respect even if faceting per se might not be "faster" with
>    "docValues=true", your overall system may in many cases perform
>    better.
> 
>    (and Anthony, I'm pretty sure that tag/ex on facets should be
>    orthogonal to the "facet.method=enum"/filterCache discussion, as
>    tag/ex only affects the DocSet domain over which facets are calculated
>    ... I think that step is pretty cleanly separated from the actual
>    calculation of the facets. I'm not 100% sure on that, so proceed with
>    caution, but it could definitely be worth evaluating for your use
>    case!)
> 
>    Michael
> 
>    On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson <erickerick...@gmail.com> 
> wrote:
>> 
>> Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ 
>> use a docValues=false
>> field for faceting/grouping/sorting/function queries. The primary point of 
>> docValues=true is twofold:
>> 
>> 1> reduce Java heap requirements by using the OS memory to hold it
>> 
>> 2> uninverting can be expensive CPU wise too, although not with just a few
>>    unique values (for each term, read the list of docs that have it and flip 
>> a bit).
>> 
>> It doesn’t really make sense to set it on an index=false field, since 
>> uninverting only happens on
>> index=true docValues=false. OTOH, I don’t think it would do any harm either. 
>> That said, I frankly
>> don’t know how that interacts with facet.method=enum.
>> 
>> As far as speed… yeah, you’re in the edge cases. All things being equal, 
>> stuffing these into the
>> filterCache is the fastest way to facet if you have the memory. I’ve seen 
>> very few installations
>> where people have that luxury though. Each entry in the filterCache can 
>> occupy maxDoc/8 + some overhead
>> bytes. If maxDoc is very large, this’ll chew up an enormous amount of 
>> memory. I’m cheating
>> a bit here since the size might be smaller if only a few docs have any 
>> particular entry then the
>> size is smaller. But that’s the worst-case you have to allow for ‘cause you 
>> could theoretically hit
>> the perfect storm where, due to some particular sequence of queries, your 
>> entire filter
>> cache fills up with entries that size.
>> 
>> You’ll have some overhead to keep the cache at that size, but it sounds like 
>> it’s worth it.
>> 
>> Best,
>> Erick
>> 
>> 
>> 
>>> On Jun 17, 2020, at 10:05 AM, James Bodkin <james.bod...@loveholidays.com> 
>>> wrote:
>>> 
>>> The large majority of the relevant fields have fewer than 20 unique values. 
>>> We have two fields over that with 150 unique values and 5300 unique values 
>>> retrospectively.
>>> At the moment, our filterCache is configured with a maximum size of 8192.
>>> 
>>> From the DocValues documentation 
>>> (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
>>> this approach promises to make lookups for faceting, sorting and grouping 
>>> much faster.
>>> Hence I thought that using DocValues would be better than using Indexed and 
>>> in turn improve our response times and possibly lower memory requirements. 
>>> It sounds like this isn't the case if you are able to allocate enough 
>>> memory to the filterCache.
>>> 
>>> I haven't yet tried changing the uninvertible setting, I was looking at the 
>>> documentation for this field earlier today.
>>> Should we be setting uninvertible="false" if docValues="true" regardless of 
>>> whether indexed is true or false?
>>> 
>>> Kind Regards,
>>> 
>>> James Bodkin
>>> 
>>> On 17/06/2020, 14:02, "Michael Gibney" <mich...@michaelgibney.net> wrote:
>>> 
>>>   facet.method=enum works by executing a query (against indexed values)
>>>   for each indexed value in a given field (which, for indexed=false, is
>>>   "no values"). So that explains why facet.method=enum no longer works.
>>>   I was going to suggest that you might not want to set indexed=false on
>>>   the docValues facet fields anyway, since the indexed values are still
>>>   used for facet refinement (assuming your index is distributed).
>>> 
>>>   What's the number of unique values in the relevant fields? If it's low
>>>   enough, setting docValues=false and indexed=true and using
>>>   facet.method=enum (with a sufficiently large filterCache) is
>>>   definitely a viable option, and will almost certainly be faster than
>>>   docValues-based faceting. (As an aside, noting for future reference:
>>>   high-cardinality facets over high-cardinality DocSet domains might be
>>>   able to benefit from a term facet count cache:
>>>   https://issues.apache.org/jira/browse/SOLR-13807)
>>> 
>>>   I think you didn't specifically mention whether you acted on Erick's
>>>   suggestion of setting "uninvertible=false" (I think Erick accidentally
>>>   said "uninvertible=true") to fail fast. I'd also recommend doing that,
>>>   perhaps even above all else -- it shouldn't actually *do* anything,
>>>   but will help ensure that things are behaving as you expect them to!
>>> 
>>>   Michael
>>> 
>>>   On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
>>>   <james.bod...@loveholidays.com> wrote:
>>>> 
>>>> Thanks, I've implemented some queries that improve the first-hit execution 
>>>> for faceting.
>>>> 
>>>> Since turning off indexed on those fields, we've noticed that 
>>>> facet.method=enum no longer returns the facets when used.
>>>> Using facet.method=fc/fcs is significantly slower compared to 
>>>> facet.method=enum for us. Why do these two differences exist?
>>>> 
>>>> On 16/06/2020, 17:52, "Erick Erickson" <erickerick...@gmail.com> wrote:
>>>> 
>>>>   Ok, I see the disconnect... Necessary parts if the index are read from 
>>>> disk
>>>>   lazily. So your newSearcher or firstSearcher query needs to do whatever
>>>>   operation causes the relevant parts of the index to be read. In this 
>>>> case,
>>>>   probably just facet on all the fields you care about. I'd add sorting too
>>>>   if you sort on different fields.
>>>> 
>>>>   The *:* query without facets or sorting does virtually nothing due to 
>>>> some
>>>>   special handling...
>>>> 
>>>>   On Tue, Jun 16, 2020, 10:48 James Bodkin <james.bod...@loveholidays.com>
>>>>   wrote:
>>>> 
>>>>> I've been trying to build a query that I can use in newSearcher based off
>>>>> the information in your previous e-mail. I thought you meant to build a 
>>>>> *:*
>>>>> query as per Query 1 in my previous e-mail but I'm still seeing the
>>>>> first-hit execution.
>>>>> Now I'm wondering if you meant to create a *:* query with each of the
>>>>> fields as part of the fl query parameters or a *:* query with each of the
>>>>> fields and values as part of the fq query parameters.
>>>>> 
>>>>> At the moment I've been running these manually as I expected that I would
>>>>> see the first-execution penalty disappear by the time I got to query 4, as
>>>>> I thought this would replicate the actions of the newSeacher.
>>>>> Unfortunately we can't use the autowarm count that is available as part of
>>>>> the filterCache/filterCache due to the custom deployment mechanism we use
>>>>> to update our index.
>>>>> 
>>>>> Kind Regards,
>>>>> 
>>>>> James Bodkin
>>>>> 
>>>>> On 16/06/2020, 15:30, "Erick Erickson" <erickerick...@gmail.com> wrote:
>>>>> 
>>>>>   Did you try the autowarming like I mentioned in my previous e-mail?
>>>>> 
>>>>>> On Jun 16, 2020, at 10:18 AM, James Bodkin <
>>>>> james.bod...@loveholidays.com> wrote:
>>>>>> 
>>>>>> We've changed the schema to enable docValues for these fields and
>>>>> this led to an improvement in the response time. We found a further
>>>>> improvement by also switching off indexed as these fields are used for
>>>>> faceting and filtering only.
>>>>>> Since those changes, we've found that the first-execution for
>>>>> queries is really noticeable. I thought this would be the filterCache 
>>>>> based
>>>>> on what I saw in NewRelic however it is probably trying to read the
>>>>> docValues from disk. How can we use the autowarming to improve this?
>>>>>> 
>>>>>> For example, I've run the following queries in sequence and each
>>>>> query has a first-execution penalty.
>>>>>> 
>>>>>> Query 1:
>>>>>> 
>>>>>> q=*:*
>>>>>> facet=true
>>>>>> facet.field=D_DepartureAirport
>>>>>> facet.field=D_Destination
>>>>>> facet.limit=-1
>>>>>> rows=0
>>>>>> 
>>>>>> Query 2:
>>>>>> 
>>>>>> q=*:*
>>>>>> fq=D_DepartureAirport:(2660)
>>>>>> facet=true
>>>>>> facet.field=D_Destination
>>>>>> facet.limit=-1
>>>>>> rows=0
>>>>>> 
>>>>>> Query 3:
>>>>>> 
>>>>>> q=*:*
>>>>>> fq=D_DepartureAirport:(2661)
>>>>>> facet=true
>>>>>> facet.field=D_Destination
>>>>>> facet.limit=-1
>>>>>> rows=0
>>>>>> 
>>>>>> Query 4:
>>>>>> 
>>>>>> q=*:*
>>>>>> fq=D_DepartureAirport:(2660+OR+2661)
>>>>>> facet=true
>>>>>> facet.field=D_Destination
>>>>>> facet.limit=-1
>>>>>> rows=0
>>>>>> 
>>>>>> We've kept the field type as a string, as the value is mapped by
>>>>> application that accesses Solr. In the examples above, the values are
>>>>> mapped to airports and destinations.
>>>>>> Is it possible to prewarm the above queries without having to define
>>>>> all the potential filters manually in the auto warming?
>>>>>> 
>>>>>> At the moment, we update and optimise our index in a different
>>>>> environment and then copy the index to our production instances by using a
>>>>> rolling deployment in Kubernetes.
>>>>>> 
>>>>>> Kind Regards,
>>>>>> 
>>>>>> James Bodkin
>>>>>> 
>>>>>> On 12/06/2020, 18:58, "Erick Erickson" <erickerick...@gmail.com>
>>>>> wrote:
>>>>>> 
>>>>>>  I question whether fiterCache has anything to do with it, I
>>>>> suspect what’s really happening is that first time you’re reading the
>>>>> relevant bits from disk into memory. And to double check you should have
>>>>> docVaues enabled for all these fields. The “uninverting” process  can be
>>>>> very expensive, and docValues bypasses that.
>>>>>> 
>>>>>>  As of Solr 7.6, you can define “uninvertible=true” to your
>>>>> field(Type) to “fail fast” if Solr needs to uninvert the field.
>>>>>> 
>>>>>>  But that’s an aside. In either case, my claim is that first-time
>>>>> execution does “something”, either reads the serialized docValues from 
>>>>> disk
>>>>> or uninverts the file on Solr’s heap.
>>>>>> 
>>>>>>  You can have this autowarmed by any combination of
>>>>>>  1> specifying an autowarm count on your queryResultCache. That’s
>>>>> hit or miss, as it replays the most recent N queries which may or may not
>>>>> contain the sorts. That said, specifying 10-20 for autowarm count is
>>>>> usually a good idea, assuming you’re not committing more than, say, every
>>>>> 30 seconds. I’d add the same to filterCache too.
>>>>>> 
>>>>>>  2> specifying a newSearcher or firstSearcher query in
>>>>> solrconfig.xml. The difference is that newSearcher is fired every time a
>>>>> commit happens, while firstSearcher is only fired when Solr starts, the
>>>>> theory being that there’s no cache autowarming available when Solr fist
>>>>> powers up. Usually, people don’t bother with firstSearcher or just make it
>>>>> the same as newSearcher. Note that a query doesn’t have to be “real” at
>>>>> all. You can just add all the facet fields to a *:* query in a single go.
>>>>>> 
>>>>>>  BTW, Trie fields will stay around for a long time even though
>>>>> deprecated. Or at least until we find something to replace them with that
>>>>> doesn’t have this penalty, so I’d feel pretty safe using those and they’ll
>>>>> be more efficient than strings.
>>>>>> 
>>>>>>  Best,
>>>>>>  Erick
>>>>>> 
>>>>> 
>>>>> 
>>

Re: Facet Performance

Reply via email to