Re: Facet Performance

Michael Gibney Wed, 17 Jun 2020 08:22:22 -0700

To expand a bit on what Erick said regarding performance: my sense is
that the RefGuide assertion that "docValues=true" makes faceting
"faster" could use some qualification/clarification. My take, fwiw:


First, to reiterate/paraphrase what Erick said: the "faster" assertion
is not comparing to "facet.method=enum". For low-cardinality fields,
if you have the heap space, and are very intentional about configuring
your filterCache (and monitoring it as access patterns might change),
"facet.method=enum" will likely be as fast as you can get (at least
for "legacy" facets or whatever -- not sure about "enum" method in
JSON facets).

Even where "docValues=true" arguably does make faceting "faster", the
main benefit is that the "uninverted" data structures are serialized
on disk, so you're avoiding the need to uninvert each facet field
on-heap for every new indexSearcher, which is generally high-latency
-- user perception of this latency can be mitigated using warming
queries, but it can still be problematic, esp. for frequent index
updates. On-heap uninversion also inherently consumes a lot of heap
space, which has general implications wrt GC, etc ... so in that
respect even if faceting per se might not be "faster" with
"docValues=true", your overall system may in many cases perform
better.

(and Anthony, I'm pretty sure that tag/ex on facets should be
orthogonal to the "facet.method=enum"/filterCache discussion, as
tag/ex only affects the DocSet domain over which facets are calculated
... I think that step is pretty cleanly separated from the actual
calculation of the facets. I'm not 100% sure on that, so proceed with
caution, but it could definitely be worth evaluating for your use
case!)

Michael

On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson <erickerick...@gmail.com> wrote:
>
> Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ 
> use a docValues=false
> field for faceting/grouping/sorting/function queries. The primary point of 
> docValues=true is twofold:
>
> 1> reduce Java heap requirements by using the OS memory to hold it
>
> 2> uninverting can be expensive CPU wise too, although not with just a few
>     unique values (for each term, read the list of docs that have it and flip 
> a bit).
>
> It doesn’t really make sense to set it on an index=false field, since 
> uninverting only happens on
> index=true docValues=false. OTOH, I don’t think it would do any harm either. 
> That said, I frankly
> don’t know how that interacts with facet.method=enum.
>
> As far as speed… yeah, you’re in the edge cases. All things being equal, 
> stuffing these into the
> filterCache is the fastest way to facet if you have the memory. I’ve seen 
> very few installations
> where people have that luxury though. Each entry in the filterCache can 
> occupy maxDoc/8 + some overhead
> bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. 
> I’m cheating
> a bit here since the size might be smaller if only a few docs have any 
> particular entry then the
> size is smaller. But that’s the worst-case you have to allow for ‘cause you 
> could theoretically hit
> the perfect storm where, due to some particular sequence of queries, your 
> entire filter
> cache fills up with entries that size.
>
> You’ll have some overhead to keep the cache at that size, but it sounds like 
> it’s worth it.
>
> Best,
> Erick
>
>
>
> > On Jun 17, 2020, at 10:05 AM, James Bodkin <james.bod...@loveholidays.com> 
> > wrote:
> >
> > The large majority of the relevant fields have fewer than 20 unique values. 
> > We have two fields over that with 150 unique values and 5300 unique values 
> > retrospectively.
> > At the moment, our filterCache is configured with a maximum size of 8192.
> >
> > From the DocValues documentation 
> > (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that 
> > this approach promises to make lookups for faceting, sorting and grouping 
> > much faster.
> > Hence I thought that using DocValues would be better than using Indexed and 
> > in turn improve our response times and possibly lower memory requirements. 
> > It sounds like this isn't the case if you are able to allocate enough 
> > memory to the filterCache.
> >
> > I haven't yet tried changing the uninvertible setting, I was looking at the 
> > documentation for this field earlier today.
> > Should we be setting uninvertible="false" if docValues="true" regardless of 
> > whether indexed is true or false?
> >
> > Kind Regards,
> >
> > James Bodkin
> >
> > On 17/06/2020, 14:02, "Michael Gibney" <mich...@michaelgibney.net> wrote:
> >
> >    facet.method=enum works by executing a query (against indexed values)
> >    for each indexed value in a given field (which, for indexed=false, is
> >    "no values"). So that explains why facet.method=enum no longer works.
> >    I was going to suggest that you might not want to set indexed=false on
> >    the docValues facet fields anyway, since the indexed values are still
> >    used for facet refinement (assuming your index is distributed).
> >
> >    What's the number of unique values in the relevant fields? If it's low
> >    enough, setting docValues=false and indexed=true and using
> >    facet.method=enum (with a sufficiently large filterCache) is
> >    definitely a viable option, and will almost certainly be faster than
> >    docValues-based faceting. (As an aside, noting for future reference:
> >    high-cardinality facets over high-cardinality DocSet domains might be
> >    able to benefit from a term facet count cache:
> >    https://issues.apache.org/jira/browse/SOLR-13807)
> >
> >    I think you didn't specifically mention whether you acted on Erick's
> >    suggestion of setting "uninvertible=false" (I think Erick accidentally
> >    said "uninvertible=true") to fail fast. I'd also recommend doing that,
> >    perhaps even above all else -- it shouldn't actually *do* anything,
> >    but will help ensure that things are behaving as you expect them to!
> >
> >    Michael
> >
> >    On Wed, Jun 17, 2020 at 4:31 AM James Bodkin
> >    <james.bod...@loveholidays.com> wrote:
> >>
> >> Thanks, I've implemented some queries that improve the first-hit execution 
> >> for faceting.
> >>
> >> Since turning off indexed on those fields, we've noticed that 
> >> facet.method=enum no longer returns the facets when used.
> >> Using facet.method=fc/fcs is significantly slower compared to 
> >> facet.method=enum for us. Why do these two differences exist?
> >>
> >> On 16/06/2020, 17:52, "Erick Erickson" <erickerick...@gmail.com> wrote:
> >>
> >>    Ok, I see the disconnect... Necessary parts if the index are read from 
> >> disk
> >>    lazily. So your newSearcher or firstSearcher query needs to do whatever
> >>    operation causes the relevant parts of the index to be read. In this 
> >> case,
> >>    probably just facet on all the fields you care about. I'd add sorting 
> >> too
> >>    if you sort on different fields.
> >>
> >>    The *:* query without facets or sorting does virtually nothing due to 
> >> some
> >>    special handling...
> >>
> >>    On Tue, Jun 16, 2020, 10:48 James Bodkin <james.bod...@loveholidays.com>
> >>    wrote:
> >>
> >>> I've been trying to build a query that I can use in newSearcher based off
> >>> the information in your previous e-mail. I thought you meant to build a 
> >>> *:*
> >>> query as per Query 1 in my previous e-mail but I'm still seeing the
> >>> first-hit execution.
> >>> Now I'm wondering if you meant to create a *:* query with each of the
> >>> fields as part of the fl query parameters or a *:* query with each of the
> >>> fields and values as part of the fq query parameters.
> >>>
> >>> At the moment I've been running these manually as I expected that I would
> >>> see the first-execution penalty disappear by the time I got to query 4, as
> >>> I thought this would replicate the actions of the newSeacher.
> >>> Unfortunately we can't use the autowarm count that is available as part of
> >>> the filterCache/filterCache due to the custom deployment mechanism we use
> >>> to update our index.
> >>>
> >>> Kind Regards,
> >>>
> >>> James Bodkin
> >>>
> >>> On 16/06/2020, 15:30, "Erick Erickson" <erickerick...@gmail.com> wrote:
> >>>
> >>>    Did you try the autowarming like I mentioned in my previous e-mail?
> >>>
> >>>> On Jun 16, 2020, at 10:18 AM, James Bodkin <
> >>> james.bod...@loveholidays.com> wrote:
> >>>>
> >>>> We've changed the schema to enable docValues for these fields and
> >>> this led to an improvement in the response time. We found a further
> >>> improvement by also switching off indexed as these fields are used for
> >>> faceting and filtering only.
> >>>> Since those changes, we've found that the first-execution for
> >>> queries is really noticeable. I thought this would be the filterCache 
> >>> based
> >>> on what I saw in NewRelic however it is probably trying to read the
> >>> docValues from disk. How can we use the autowarming to improve this?
> >>>>
> >>>> For example, I've run the following queries in sequence and each
> >>> query has a first-execution penalty.
> >>>>
> >>>> Query 1:
> >>>>
> >>>> q=*:*
> >>>> facet=true
> >>>> facet.field=D_DepartureAirport
> >>>> facet.field=D_Destination
> >>>> facet.limit=-1
> >>>> rows=0
> >>>>
> >>>> Query 2:
> >>>>
> >>>> q=*:*
> >>>> fq=D_DepartureAirport:(2660)
> >>>> facet=true
> >>>> facet.field=D_Destination
> >>>> facet.limit=-1
> >>>> rows=0
> >>>>
> >>>> Query 3:
> >>>>
> >>>> q=*:*
> >>>> fq=D_DepartureAirport:(2661)
> >>>> facet=true
> >>>> facet.field=D_Destination
> >>>> facet.limit=-1
> >>>> rows=0
> >>>>
> >>>> Query 4:
> >>>>
> >>>> q=*:*
> >>>> fq=D_DepartureAirport:(2660+OR+2661)
> >>>> facet=true
> >>>> facet.field=D_Destination
> >>>> facet.limit=-1
> >>>> rows=0
> >>>>
> >>>> We've kept the field type as a string, as the value is mapped by
> >>> application that accesses Solr. In the examples above, the values are
> >>> mapped to airports and destinations.
> >>>> Is it possible to prewarm the above queries without having to define
> >>> all the potential filters manually in the auto warming?
> >>>>
> >>>> At the moment, we update and optimise our index in a different
> >>> environment and then copy the index to our production instances by using a
> >>> rolling deployment in Kubernetes.
> >>>>
> >>>> Kind Regards,
> >>>>
> >>>> James Bodkin
> >>>>
> >>>> On 12/06/2020, 18:58, "Erick Erickson" <erickerick...@gmail.com>
> >>> wrote:
> >>>>
> >>>>   I question whether fiterCache has anything to do with it, I
> >>> suspect what’s really happening is that first time you’re reading the
> >>> relevant bits from disk into memory. And to double check you should have
> >>> docVaues enabled for all these fields. The “uninverting” process  can be
> >>> very expensive, and docValues bypasses that.
> >>>>
> >>>>   As of Solr 7.6, you can define “uninvertible=true” to your
> >>> field(Type) to “fail fast” if Solr needs to uninvert the field.
> >>>>
> >>>>   But that’s an aside. In either case, my claim is that first-time
> >>> execution does “something”, either reads the serialized docValues from 
> >>> disk
> >>> or uninverts the file on Solr’s heap.
> >>>>
> >>>>   You can have this autowarmed by any combination of
> >>>>   1> specifying an autowarm count on your queryResultCache. That’s
> >>> hit or miss, as it replays the most recent N queries which may or may not
> >>> contain the sorts. That said, specifying 10-20 for autowarm count is
> >>> usually a good idea, assuming you’re not committing more than, say, every
> >>> 30 seconds. I’d add the same to filterCache too.
> >>>>
> >>>>   2> specifying a newSearcher or firstSearcher query in
> >>> solrconfig.xml. The difference is that newSearcher is fired every time a
> >>> commit happens, while firstSearcher is only fired when Solr starts, the
> >>> theory being that there’s no cache autowarming available when Solr fist
> >>> powers up. Usually, people don’t bother with firstSearcher or just make it
> >>> the same as newSearcher. Note that a query doesn’t have to be “real” at
> >>> all. You can just add all the facet fields to a *:* query in a single go.
> >>>>
> >>>>   BTW, Trie fields will stay around for a long time even though
> >>> deprecated. Or at least until we find something to replace them with that
> >>> doesn’t have this penalty, so I’d feel pretty safe using those and they’ll
> >>> be more efficient than strings.
> >>>>
> >>>>   Best,
> >>>>   Erick
> >>>>
> >>>
> >>>
>

Re: Facet Performance

Reply via email to