[jira] [Commented] (SOLR-13807) Caching for term facet counts

Michael Gibney (Jira) Tue, 10 Mar 2020 09:17:11 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17056100#comment-17056100
 ]


Michael Gibney commented on SOLR-13807:
---------------------------------------

Thanks for responding on these points, [~hossman]! Apologies for my delay in 
responding, but it's taken me a while to dig into actually addressing some of 
the issues uncovered by testing (just pushed to [PR 
#751|https://github.com/apache/lucene-solr/pull/751]). Before embarking on a 
potential major refactor of PR code that is I believe essentially sound, I 
first wanted to address the test failures in the existing PR and then see where 
we are with things.

The changes required were not large in terms of number of lines of code. Aside 
from some trivial bug fixes, the substantive issues addressed fell into three 
categories, broadly speaking:
 # UIF caching was an afterthought in the initial patch. I knew this at the 
time I opened the PR (and should have called it out more explicitly) but 
although I had roughed in some of the cache-entry-building logic as a POC, 
nothing was ever actually getting inserted in the cache \(!) and not all code 
branches were covered. It was fairly straightforward to bring UIF into line 
(and I re-enabled the UIF cases from your initial test).
 # Cache entry compatibility across different methods of facet processing. I 
had to clarify that term counts are only eligible for caching when 
{{prefix==null}} (or {{prefix.isEmpty()}}). (It would be possible to use 
no-prefix cached term counts to process prefixed facet requests, but I think it 
makes sense to leave that for later, if at all). Aside from that, missing 
buckets are collected _inline_ and cached for {{FacetFieldProcessorByArrayDV}}, 
but are _not_ collected (nor cached) for {{FacetFieldProcessorByArrayUIF}} (or 
legacy {{DocValuesFacets}}) processing. In practice, it's unlikely that the 
same field would be processed both as UIF (no cached "missing" count) _and_ as 
DV (cached "missing" count), but the case did come up in testing, and I 
addressed it by detecting and re-processing with {{*ByArrayDV}}, and replacing 
the cache entry with the new one that includes "missing" count. The resulting 
"missing"-inclusive cache-entry is backward-compatible with (may be used by) 
{{*ByArrayUIF}} and legacy {{DocValuesFacets}} processing implementations. 
Incidentally, I wonder whether this "inline" collection of "missing" counts is 
something like what you had in mind with the comment "{{TODO: it would be more 
efficient to build up a missing DocSet if we need it here anyway.}}"?
 # Cache key compatibility across blockJoin domain changes. The extant "nested 
facet" implementation only passes the {{base}} DocSet domain down from parent 
to child. One of the things this PR had to do was to also track corresponding 
changes to the {{baseFilters}} – the queries used to generate the {{base}} 
DocSet domain – because these queries are required for use in facet cache keys. 
The initial PR punted on the question of blockJoin domain changes, and simply 
set {{baseFilters = null}}, with a comment in code: "{{unusual case; TODO: can 
we make a cache key for this base domain?}}". Well I meant "unusual _for me, at 
the moment_" :); I just had to put the effort into building proper 
({{baseFilter}} query) cache keys for these domain changes. In the process, I 
also realized that tracking {{baseFilters}} down the nested facet tree should 
probably address "{{TODO: somehow remove responsebuilder dependency}}" – I put 
a {{nocommit}} comment to that effect (and temporarily throw an 
{{AssertionError}} to highlight what I think can now be dead code following). I 
also found myself wondering how exclusion of ancestor tagged filters would 
affect descendent join/graph/blockjoin domain changes ... but that's a separate 
issue.

> Caching for term facet counts
> -----------------------------
>
>                 Key: SOLR-13807
>                 URL: https://issues.apache.org/jira/browse/SOLR-13807
>             Project: Solr
>          Issue Type: New Feature
>          Components: Facet Module
>    Affects Versions: master (9.0), 8.2
>            Reporter: Michael Gibney
>            Priority: Minor
>         Attachments: SOLR-13807__SOLR-13132_test_stub.patch
>
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> As a result, subsequent requests end up redoing a lot of the same work, 
> including all associated object allocation, GC, etc. This situation could 
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is high latency 
> for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be 
> observed on a top-level landing page that exposes facets. This type of 
> "static" case is often mitigated by external (to Solr) caching, either with a 
> caching layer between Solr and a front-end application, or within a front-end 
> application, or even with a caching layer between the end user and a 
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the 
> stack (or, for a new user, even being aware of this as a potential issue to 
> mitigate), any external caching mitigation is really only appropriate for 
> relatively static cases like the "landing page" example described above. A 
> Solr-internal facet count cache (analogous to the {{filterCache}}) would 
> provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance 
> concern
>  # compact (specifically caching count arrays, without the extra baggage that 
> accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with 
> variant requests over the same result domain (this would support common use 
> cases like paging, but also potentially more interesting direct uses of 
> facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a 
> given domain are cached, a refinement request could simply look up the 
> ordinal value for each enumerated term and directly grab the count out of the 
> count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on 
> facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13807) Caching for term facet counts

Reply via email to