To expand a bit on what Erick said regarding performance: my sense is that the RefGuide assertion that "docValues=true" makes faceting "faster" could use some qualification/clarification. My take, fwiw:
First, to reiterate/paraphrase what Erick said: the "faster" assertion is not comparing to "facet.method=enum". For low-cardinality fields, if you have the heap space, and are very intentional about configuring your filterCache (and monitoring it as access patterns might change), "facet.method=enum" will likely be as fast as you can get (at least for "legacy" facets or whatever -- not sure about "enum" method in JSON facets). Even where "docValues=true" arguably does make faceting "faster", the main benefit is that the "uninverted" data structures are serialized on disk, so you're avoiding the need to uninvert each facet field on-heap for every new indexSearcher, which is generally high-latency -- user perception of this latency can be mitigated using warming queries, but it can still be problematic, esp. for frequent index updates. On-heap uninversion also inherently consumes a lot of heap space, which has general implications wrt GC, etc ... so in that respect even if faceting per se might not be "faster" with "docValues=true", your overall system may in many cases perform better. (and Anthony, I'm pretty sure that tag/ex on facets should be orthogonal to the "facet.method=enum"/filterCache discussion, as tag/ex only affects the DocSet domain over which facets are calculated ... I think that step is pretty cleanly separated from the actual calculation of the facets. I'm not 100% sure on that, so proceed with caution, but it could definitely be worth evaluating for your use case!) Michael On Wed, Jun 17, 2020 at 10:42 AM Erick Erickson <erickerick...@gmail.com> wrote: > > Uninvertible is a safety mechanism to make sure that you don’t _unknowingly_ > use a docValues=false > field for faceting/grouping/sorting/function queries. The primary point of > docValues=true is twofold: > > 1> reduce Java heap requirements by using the OS memory to hold it > > 2> uninverting can be expensive CPU wise too, although not with just a few > unique values (for each term, read the list of docs that have it and flip > a bit). > > It doesn’t really make sense to set it on an index=false field, since > uninverting only happens on > index=true docValues=false. OTOH, I don’t think it would do any harm either. > That said, I frankly > don’t know how that interacts with facet.method=enum. > > As far as speed… yeah, you’re in the edge cases. All things being equal, > stuffing these into the > filterCache is the fastest way to facet if you have the memory. I’ve seen > very few installations > where people have that luxury though. Each entry in the filterCache can > occupy maxDoc/8 + some overhead > bytes. If maxDoc is very large, this’ll chew up an enormous amount of memory. > I’m cheating > a bit here since the size might be smaller if only a few docs have any > particular entry then the > size is smaller. But that’s the worst-case you have to allow for ‘cause you > could theoretically hit > the perfect storm where, due to some particular sequence of queries, your > entire filter > cache fills up with entries that size. > > You’ll have some overhead to keep the cache at that size, but it sounds like > it’s worth it. > > Best, > Erick > > > > > On Jun 17, 2020, at 10:05 AM, James Bodkin <james.bod...@loveholidays.com> > > wrote: > > > > The large majority of the relevant fields have fewer than 20 unique values. > > We have two fields over that with 150 unique values and 5300 unique values > > retrospectively. > > At the moment, our filterCache is configured with a maximum size of 8192. > > > > From the DocValues documentation > > (https://lucene.apache.org/solr/guide/8_3/docvalues.html), it mentions that > > this approach promises to make lookups for faceting, sorting and grouping > > much faster. > > Hence I thought that using DocValues would be better than using Indexed and > > in turn improve our response times and possibly lower memory requirements. > > It sounds like this isn't the case if you are able to allocate enough > > memory to the filterCache. > > > > I haven't yet tried changing the uninvertible setting, I was looking at the > > documentation for this field earlier today. > > Should we be setting uninvertible="false" if docValues="true" regardless of > > whether indexed is true or false? > > > > Kind Regards, > > > > James Bodkin > > > > On 17/06/2020, 14:02, "Michael Gibney" <mich...@michaelgibney.net> wrote: > > > > facet.method=enum works by executing a query (against indexed values) > > for each indexed value in a given field (which, for indexed=false, is > > "no values"). So that explains why facet.method=enum no longer works. > > I was going to suggest that you might not want to set indexed=false on > > the docValues facet fields anyway, since the indexed values are still > > used for facet refinement (assuming your index is distributed). > > > > What's the number of unique values in the relevant fields? If it's low > > enough, setting docValues=false and indexed=true and using > > facet.method=enum (with a sufficiently large filterCache) is > > definitely a viable option, and will almost certainly be faster than > > docValues-based faceting. (As an aside, noting for future reference: > > high-cardinality facets over high-cardinality DocSet domains might be > > able to benefit from a term facet count cache: > > https://issues.apache.org/jira/browse/SOLR-13807) > > > > I think you didn't specifically mention whether you acted on Erick's > > suggestion of setting "uninvertible=false" (I think Erick accidentally > > said "uninvertible=true") to fail fast. I'd also recommend doing that, > > perhaps even above all else -- it shouldn't actually *do* anything, > > but will help ensure that things are behaving as you expect them to! > > > > Michael > > > > On Wed, Jun 17, 2020 at 4:31 AM James Bodkin > > <james.bod...@loveholidays.com> wrote: > >> > >> Thanks, I've implemented some queries that improve the first-hit execution > >> for faceting. > >> > >> Since turning off indexed on those fields, we've noticed that > >> facet.method=enum no longer returns the facets when used. > >> Using facet.method=fc/fcs is significantly slower compared to > >> facet.method=enum for us. Why do these two differences exist? > >> > >> On 16/06/2020, 17:52, "Erick Erickson" <erickerick...@gmail.com> wrote: > >> > >> Ok, I see the disconnect... Necessary parts if the index are read from > >> disk > >> lazily. So your newSearcher or firstSearcher query needs to do whatever > >> operation causes the relevant parts of the index to be read. In this > >> case, > >> probably just facet on all the fields you care about. I'd add sorting > >> too > >> if you sort on different fields. > >> > >> The *:* query without facets or sorting does virtually nothing due to > >> some > >> special handling... > >> > >> On Tue, Jun 16, 2020, 10:48 James Bodkin <james.bod...@loveholidays.com> > >> wrote: > >> > >>> I've been trying to build a query that I can use in newSearcher based off > >>> the information in your previous e-mail. I thought you meant to build a > >>> *:* > >>> query as per Query 1 in my previous e-mail but I'm still seeing the > >>> first-hit execution. > >>> Now I'm wondering if you meant to create a *:* query with each of the > >>> fields as part of the fl query parameters or a *:* query with each of the > >>> fields and values as part of the fq query parameters. > >>> > >>> At the moment I've been running these manually as I expected that I would > >>> see the first-execution penalty disappear by the time I got to query 4, as > >>> I thought this would replicate the actions of the newSeacher. > >>> Unfortunately we can't use the autowarm count that is available as part of > >>> the filterCache/filterCache due to the custom deployment mechanism we use > >>> to update our index. > >>> > >>> Kind Regards, > >>> > >>> James Bodkin > >>> > >>> On 16/06/2020, 15:30, "Erick Erickson" <erickerick...@gmail.com> wrote: > >>> > >>> Did you try the autowarming like I mentioned in my previous e-mail? > >>> > >>>> On Jun 16, 2020, at 10:18 AM, James Bodkin < > >>> james.bod...@loveholidays.com> wrote: > >>>> > >>>> We've changed the schema to enable docValues for these fields and > >>> this led to an improvement in the response time. We found a further > >>> improvement by also switching off indexed as these fields are used for > >>> faceting and filtering only. > >>>> Since those changes, we've found that the first-execution for > >>> queries is really noticeable. I thought this would be the filterCache > >>> based > >>> on what I saw in NewRelic however it is probably trying to read the > >>> docValues from disk. How can we use the autowarming to improve this? > >>>> > >>>> For example, I've run the following queries in sequence and each > >>> query has a first-execution penalty. > >>>> > >>>> Query 1: > >>>> > >>>> q=*:* > >>>> facet=true > >>>> facet.field=D_DepartureAirport > >>>> facet.field=D_Destination > >>>> facet.limit=-1 > >>>> rows=0 > >>>> > >>>> Query 2: > >>>> > >>>> q=*:* > >>>> fq=D_DepartureAirport:(2660) > >>>> facet=true > >>>> facet.field=D_Destination > >>>> facet.limit=-1 > >>>> rows=0 > >>>> > >>>> Query 3: > >>>> > >>>> q=*:* > >>>> fq=D_DepartureAirport:(2661) > >>>> facet=true > >>>> facet.field=D_Destination > >>>> facet.limit=-1 > >>>> rows=0 > >>>> > >>>> Query 4: > >>>> > >>>> q=*:* > >>>> fq=D_DepartureAirport:(2660+OR+2661) > >>>> facet=true > >>>> facet.field=D_Destination > >>>> facet.limit=-1 > >>>> rows=0 > >>>> > >>>> We've kept the field type as a string, as the value is mapped by > >>> application that accesses Solr. In the examples above, the values are > >>> mapped to airports and destinations. > >>>> Is it possible to prewarm the above queries without having to define > >>> all the potential filters manually in the auto warming? > >>>> > >>>> At the moment, we update and optimise our index in a different > >>> environment and then copy the index to our production instances by using a > >>> rolling deployment in Kubernetes. > >>>> > >>>> Kind Regards, > >>>> > >>>> James Bodkin > >>>> > >>>> On 12/06/2020, 18:58, "Erick Erickson" <erickerick...@gmail.com> > >>> wrote: > >>>> > >>>> I question whether fiterCache has anything to do with it, I > >>> suspect what’s really happening is that first time you’re reading the > >>> relevant bits from disk into memory. And to double check you should have > >>> docVaues enabled for all these fields. The “uninverting” process can be > >>> very expensive, and docValues bypasses that. > >>>> > >>>> As of Solr 7.6, you can define “uninvertible=true” to your > >>> field(Type) to “fail fast” if Solr needs to uninvert the field. > >>>> > >>>> But that’s an aside. In either case, my claim is that first-time > >>> execution does “something”, either reads the serialized docValues from > >>> disk > >>> or uninverts the file on Solr’s heap. > >>>> > >>>> You can have this autowarmed by any combination of > >>>> 1> specifying an autowarm count on your queryResultCache. That’s > >>> hit or miss, as it replays the most recent N queries which may or may not > >>> contain the sorts. That said, specifying 10-20 for autowarm count is > >>> usually a good idea, assuming you’re not committing more than, say, every > >>> 30 seconds. I’d add the same to filterCache too. > >>>> > >>>> 2> specifying a newSearcher or firstSearcher query in > >>> solrconfig.xml. The difference is that newSearcher is fired every time a > >>> commit happens, while firstSearcher is only fired when Solr starts, the > >>> theory being that there’s no cache autowarming available when Solr fist > >>> powers up. Usually, people don’t bother with firstSearcher or just make it > >>> the same as newSearcher. Note that a query doesn’t have to be “real” at > >>> all. You can just add all the facet fields to a *:* query in a single go. > >>>> > >>>> BTW, Trie fields will stay around for a long time even though > >>> deprecated. Or at least until we find something to replace them with that > >>> doesn’t have this penalty, so I’d feel pretty safe using those and they’ll > >>> be more efficient than strings. > >>>> > >>>> Best, > >>>> Erick > >>>> > >>> > >>> >