queryResultCache doesn’t really help with faceting, even if it’s hit for the main query. That cache only stores a subset of the hits, and to facet properly you need the entire result set….
> On Jun 17, 2020, at 12:47 PM, James Bodkin <james.bod...@loveholidays.com> > wrote: > > We've noticed that the filterCache uses a significant amount of memory, as > we've assigned 8GB Heap per instance. > In total, we have 32 shards with 2 replicas, hence (8*32*2) 512G Heap space > alone, further memory is required to ensure the index is always memory mapped > for performance reasons. > > Ideally I would like to be able to reduce the amount of memory assigned to > the heap by using docValues instead of indexed but it doesn't seem possible. > The QTime (after warming) for facet.method=enum is around 150-250ms whereas > the QTime for facet.method=fc is around 1000-1200ms. > As we require the results in real-time for customers searching on our > website, the later QTime of 1000-1200ms is too slow for us to be able to use. > > Our facet queries change as the customer selects different search criteria, > and hence the possible number of potential queries makes it very difficult > for the query result cache. > We already have a custom implementation in which we check our redis cache for > queries before they are sent to our aggregators which runs at 30% hit rate. > > Kind Regards, > > James Bodkin > > On 17/06/2020, 16:21, "Michael Gibney" <mich...@michaelgibney.net> wrote: > > To expand a bit on what Erick said regarding performance: my sense is > that the RefGuide assertion that "docValues=true" makes faceting > "faster" could use some qualification/clarification. My take, fwiw: > > First, to reiterate/paraphrase what Erick said: the "faster" assertion > is not comparing to "facet.method=enum". For low-cardinality fields, > if you have the heap space, and are very intentional about configuring > your filterCache (and monitoring it as access patterns might change), > "facet.method=enum" will likely be as fast as you can get (at least > for "legacy" facets or whatever -- not sure about "enum" method in > JSON facets). > > Even where "docValues=true" arguably does make faceting "faster", the > main benefit is that the "uninverted" data structures are serialized > on disk, so you're avoiding the need to uninvert each facet field > on-heap for every new indexSearcher, which is generally high-latency > -- user perception of this latency can be mitigated using warming > queries, but it can still be problematic, esp. for frequent index > updates. On-heap uninversion also inherently consumes a lot of heap > space, which has general implications wrt GC, etc ... so in that > respect even if faceting per se might not be "faster" with > "docValues=true", your overall system may in many cases perform > better. > > (and Anthony, I'm pretty sure that tag/ex on facets should be > orthogonal to the "facet.method=enum"/filterCache discussion, as > tag/ex only affects the DocSet domain over which facets are calculated > ... I think that step is pretty cleanly separated from the actual > calculation of the facets. I'm not 100% sure on that, so proceed with > caution, but it could definitely be worth evaluating for your use > case!) > > Michael > > On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson <erickerick...@gmail.com> > wrote: >> >> Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ >> use a docValues=false >> field for faceting/grouping/sorting/function queries. The primary point of >> docValues=true is twofold: >> >> 1> reduce Java heap requirements by using the OS memory to hold it >> >> 2> uninverting can be expensive CPU wise too, although not with just a few >> unique values (for each term, read the list of docs that have it and flip >> a bit). >> >> It doesn’t really make sense to set it on an index=false field, since >> uninverting only happens on >> index=true docValues=false. OTOH, I don’t think it would do any harm either. >> That said, I frankly >> don’t know how that interacts with facet.method=enum. >> >> As far as speed… yeah, you’re in the edge cases. All things being equal, >> stuffing these into the >> filterCache is the fastest way to facet if you have the memory. I’ve seen >> very few installations >> where people have that luxury though. Each entry in the filterCache can >> occupy maxDoc/8 + some overhead >> bytes. If maxDoc is very large, this’ll chew up an enormous amount of >> memory. I’m cheating >> a bit here since the size might be smaller if only a few docs have any >> particular entry then the >> size is smaller. But that’s the worst-case you have to allow for ‘cause you >> could theoretically hit >> the perfect storm where, due to some particular sequence of queries, your >> entire filter >> cache fills up with entries that size. >> >> You’ll have some overhead to keep the cache at that size, but it sounds like >> it’s worth it. >> >> Best, >> Erick >> >> >> >>> On Jun 17, 2020, at 10:05 AM, James Bodkin <james.bod...@loveholidays.com> >>> wrote: >>> >>> The large majority of the relevant fields have fewer than 20 unique values. >>> We have two fields over that with 150 unique values and 5300 unique values >>> retrospectively. >>> At the moment, our filterCache is configured with a maximum size of 8192. >>> >>> From the DocValues documentation >>> (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that >>> this approach promises to make lookups for faceting, sorting and grouping >>> much faster. >>> Hence I thought that using DocValues would be better than using Indexed and >>> in turn improve our response times and possibly lower memory requirements. >>> It sounds like this isn't the case if you are able to allocate enough >>> memory to the filterCache. >>> >>> I haven't yet tried changing the uninvertible setting, I was looking at the >>> documentation for this field earlier today. >>> Should we be setting uninvertible="false" if docValues="true" regardless of >>> whether indexed is true or false? >>> >>> Kind Regards, >>> >>> James Bodkin >>> >>> On 17/06/2020, 14:02, "Michael Gibney" <mich...@michaelgibney.net> wrote: >>> >>> facet.method=enum works by executing a query (against indexed values) >>> for each indexed value in a given field (which, for indexed=false, is >>> "no values"). So that explains why facet.method=enum no longer works. >>> I was going to suggest that you might not want to set indexed=false on >>> the docValues facet fields anyway, since the indexed values are still >>> used for facet refinement (assuming your index is distributed). >>> >>> What's the number of unique values in the relevant fields? If it's low >>> enough, setting docValues=false and indexed=true and using >>> facet.method=enum (with a sufficiently large filterCache) is >>> definitely a viable option, and will almost certainly be faster than >>> docValues-based faceting. (As an aside, noting for future reference: >>> high-cardinality facets over high-cardinality DocSet domains might be >>> able to benefit from a term facet count cache: >>> https://issues.apache.org/jira/browse/SOLR-13807) >>> >>> I think you didn't specifically mention whether you acted on Erick's >>> suggestion of setting "uninvertible=false" (I think Erick accidentally >>> said "uninvertible=true") to fail fast. I'd also recommend doing that, >>> perhaps even above all else -- it shouldn't actually *do* anything, >>> but will help ensure that things are behaving as you expect them to! >>> >>> Michael >>> >>> On Wed, Jun 17, 2020 at 4:31 AM James Bodkin >>> <james.bod...@loveholidays.com> wrote: >>>> >>>> Thanks, I've implemented some queries that improve the first-hit execution >>>> for faceting. >>>> >>>> Since turning off indexed on those fields, we've noticed that >>>> facet.method=enum no longer returns the facets when used. >>>> Using facet.method=fc/fcs is significantly slower compared to >>>> facet.method=enum for us. Why do these two differences exist? >>>> >>>> On 16/06/2020, 17:52, "Erick Erickson" <erickerick...@gmail.com> wrote: >>>> >>>> Ok, I see the disconnect... Necessary parts if the index are read from >>>> disk >>>> lazily. So your newSearcher or firstSearcher query needs to do whatever >>>> operation causes the relevant parts of the index to be read. In this >>>> case, >>>> probably just facet on all the fields you care about. I'd add sorting too >>>> if you sort on different fields. >>>> >>>> The *:* query without facets or sorting does virtually nothing due to >>>> some >>>> special handling... >>>> >>>> On Tue, Jun 16, 2020, 10:48 James Bodkin <james.bod...@loveholidays.com> >>>> wrote: >>>> >>>>> I've been trying to build a query that I can use in newSearcher based off >>>>> the information in your previous e-mail. I thought you meant to build a >>>>> *:* >>>>> query as per Query 1 in my previous e-mail but I'm still seeing the >>>>> first-hit execution. >>>>> Now I'm wondering if you meant to create a *:* query with each of the >>>>> fields as part of the fl query parameters or a *:* query with each of the >>>>> fields and values as part of the fq query parameters. >>>>> >>>>> At the moment I've been running these manually as I expected that I would >>>>> see the first-execution penalty disappear by the time I got to query 4, as >>>>> I thought this would replicate the actions of the newSeacher. >>>>> Unfortunately we can't use the autowarm count that is available as part of >>>>> the filterCache/filterCache due to the custom deployment mechanism we use >>>>> to update our index. >>>>> >>>>> Kind Regards, >>>>> >>>>> James Bodkin >>>>> >>>>> On 16/06/2020, 15:30, "Erick Erickson" <erickerick...@gmail.com> wrote: >>>>> >>>>> Did you try the autowarming like I mentioned in my previous e-mail? >>>>> >>>>>> On Jun 16, 2020, at 10:18 AM, James Bodkin < >>>>> james.bod...@loveholidays.com> wrote: >>>>>> >>>>>> We've changed the schema to enable docValues for these fields and >>>>> this led to an improvement in the response time. We found a further >>>>> improvement by also switching off indexed as these fields are used for >>>>> faceting and filtering only. >>>>>> Since those changes, we've found that the first-execution for >>>>> queries is really noticeable. I thought this would be the filterCache >>>>> based >>>>> on what I saw in NewRelic however it is probably trying to read the >>>>> docValues from disk. How can we use the autowarming to improve this? >>>>>> >>>>>> For example, I've run the following queries in sequence and each >>>>> query has a first-execution penalty. >>>>>> >>>>>> Query 1: >>>>>> >>>>>> q=*:* >>>>>> facet=true >>>>>> facet.field=D_DepartureAirport >>>>>> facet.field=D_Destination >>>>>> facet.limit=-1 >>>>>> rows=0 >>>>>> >>>>>> Query 2: >>>>>> >>>>>> q=*:* >>>>>> fq=D_DepartureAirport:(2660) >>>>>> facet=true >>>>>> facet.field=D_Destination >>>>>> facet.limit=-1 >>>>>> rows=0 >>>>>> >>>>>> Query 3: >>>>>> >>>>>> q=*:* >>>>>> fq=D_DepartureAirport:(2661) >>>>>> facet=true >>>>>> facet.field=D_Destination >>>>>> facet.limit=-1 >>>>>> rows=0 >>>>>> >>>>>> Query 4: >>>>>> >>>>>> q=*:* >>>>>> fq=D_DepartureAirport:(2660+OR+2661) >>>>>> facet=true >>>>>> facet.field=D_Destination >>>>>> facet.limit=-1 >>>>>> rows=0 >>>>>> >>>>>> We've kept the field type as a string, as the value is mapped by >>>>> application that accesses Solr. In the examples above, the values are >>>>> mapped to airports and destinations. >>>>>> Is it possible to prewarm the above queries without having to define >>>>> all the potential filters manually in the auto warming? >>>>>> >>>>>> At the moment, we update and optimise our index in a different >>>>> environment and then copy the index to our production instances by using a >>>>> rolling deployment in Kubernetes. >>>>>> >>>>>> Kind Regards, >>>>>> >>>>>> James Bodkin >>>>>> >>>>>> On 12/06/2020, 18:58, "Erick Erickson" <erickerick...@gmail.com> >>>>> wrote: >>>>>> >>>>>> I question whether fiterCache has anything to do with it, I >>>>> suspect what’s really happening is that first time you’re reading the >>>>> relevant bits from disk into memory. And to double check you should have >>>>> docVaues enabled for all these fields. The “uninverting” process can be >>>>> very expensive, and docValues bypasses that. >>>>>> >>>>>> As of Solr 7.6, you can define “uninvertible=true” to your >>>>> field(Type) to “fail fast” if Solr needs to uninvert the field. >>>>>> >>>>>> But that’s an aside. In either case, my claim is that first-time >>>>> execution does “something”, either reads the serialized docValues from >>>>> disk >>>>> or uninverts the file on Solr’s heap. >>>>>> >>>>>> You can have this autowarmed by any combination of >>>>>> 1> specifying an autowarm count on your queryResultCache. That’s >>>>> hit or miss, as it replays the most recent N queries which may or may not >>>>> contain the sorts. That said, specifying 10-20 for autowarm count is >>>>> usually a good idea, assuming you’re not committing more than, say, every >>>>> 30 seconds. I’d add the same to filterCache too. >>>>>> >>>>>> 2> specifying a newSearcher or firstSearcher query in >>>>> solrconfig.xml. The difference is that newSearcher is fired every time a >>>>> commit happens, while firstSearcher is only fired when Solr starts, the >>>>> theory being that there’s no cache autowarming available when Solr fist >>>>> powers up. Usually, people don’t bother with firstSearcher or just make it >>>>> the same as newSearcher. Note that a query doesn’t have to be “real” at >>>>> all. You can just add all the facet fields to a *:* query in a single go. >>>>>> >>>>>> BTW, Trie fields will stay around for a long time even though >>>>> deprecated. Or at least until we find something to replace them with that >>>>> doesn’t have this penalty, so I’d feel pretty safe using those and they’ll >>>>> be more efficient than strings. >>>>>> >>>>>> Best, >>>>>> Erick >>>>>> >>>>> >>>>> >>