[
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175836#comment-17175836
]
Michael Gibney commented on SOLR-13807:
---------------------------------------
After SOLR-13132 was merged to master, it was a bit of challenge to reconcile
with the complementary "term facet cache" (this issue). I've taken an initial
stab at this and pushed to [PR
#1357|https://github.com/apache/lucene-solr/pull/1357], and I think it's at the
point where it's once again ready for consideration.
Below are some naive performance benchmarks, using [^SOLR-13807-benchmarks.tgz]
(based on similar benchmarks for SOLR-13132).
{{filterCache}} is irrelevant for what's illustrated here (all count or sweep
collection, single-shard thus no refinement). I included hooks in the included
scripts to easily change the filterCache size and termFacetCache size for
evaluation. For purpose of {{relatedness}} evaluation, fgSet == base search
result domain. All results discussed here are for single-valued string fields,
but multivalued string fields are also included in the benchmark attachment
(results for multi-valued didn't differ substantially from those for
single-valued).
There's a row for each docset domain recall percentage (percentage of \*:*
domain returned by main query/fg), and a column for each field cardinality;
cell values indicate latency (QTime) in ms against a single core with 3 million
docs, no deletes; each value is the average of 10 repeated invocations of the
the relevant request (standard deviation isn't captured here, but was quite
low, fwiw).
Below are for current (including SOLR-13132) master; no caches (filterCache, if
present, would be unused):
{code}
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, master
cdnlty: 10 100 1k 10k 100k 1m
.1% 0 0 0 0 0 4
1% 1 0 1 1 2 5
10% 7 7 8 8 10 16
20% 17 14 16 15 19 31
30% 22 19 23 20 24 42
40% 27 26 28 28 32 50
50% 33 32 35 32 38 59
99.99% 65 60 67 62 72 107
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, master
cdnlty: 10 100 1k 10k 100k 1m
.1% 179 174 183 190 192 225
1% 182 177 186 183 194 236
10% 193 191 196 197 226 256
20% 206 200 207 207 234 300
30% 216 210 217 216 239 316
40% 228 225 231 231 253 331
50% 239 234 241 240 266 347
99.99% 285 280 287 287 311 403
{code}
Below are for 77daac4ae2a4d1c40652eafbbdb42b582fe2d02d (SOLR-13807), with _no_
termFacetCache configured (apples-to-apples, since there are changes in some of
the hot facet code paths):
{code}
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count,
no_cache
cdnlty: 10 100 1k 10k 100k 1m
.1% 0 0 0 0 0 3
1% 1 1 1 1 1 6
10% 8 8 9 8 11 14
20% 16 15 16 15 20 32
30% 21 21 23 22 26 42
40% 28 27 31 28 34 53
50% 35 33 37 34 40 63
99.99% 68 64 71 66 74 108
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, no_cache
cdnlty: 10 100 1k 10k 100k 1m
.1% 96 80 89 97 96 129
1% 88 83 90 88 101 133
10% 99 97 103 102 122 162
20% 117 107 113 113 135 194
30% 120 117 123 122 144 211
40% 130 129 134 134 156 232
50% 143 140 147 144 169 249
99.99% 179 175 181 179 201 305
{code}
Below are for 77daac4ae2a4d1c40652eafbbdb42b582fe2d02d (SOLR-13807), with
{{solr.termFacetCacheSize=20}} configured.
{code}
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, cache
size 20
cdnlty: 10 100 1k 10k 100k 1m
.1% 0 0 0 0 0 2
1% 0 0 0 0 1 10
10% 3 4 4 4 5 16
20% 8 7 8 7 9 20
30% 11 10 12 11 13 25
40% 13 13 15 15 15 28
50% 15 16 16 18 20 32
99.99% 29 30 30 29 32 45
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, cache
size 20
cdnlty: 10 100 1k 10k 100k 1m
.1% 0 0 0 0 1 6
1% 0 0 0 1 4 14
10% 3 4 4 5 11 33
20% 9 8 8 8 16 41
30% 10 10 11 12 17 51
40% 13 13 13 14 20 61
50% 16 15 17 17 23 69
99.99% 30 28 30 30 37 101
{code}
The performance boost for sort-by-count has all the normal caveats of any type
of caching, but could result in huge practical performance benefits for "main
index page" and/or paging requests that use facets.
The performance boost for sort-by-skg, on the other hand, in many cases even
transcends normal caching caveats (assuming sweep collection and a relatively
static "background set"). With sweep collection, the common-case background set
of \*:*, e.g., would be cached and used repeatedly even with a minimal
termFacetCache (say, size=10), making for an uncharacteristically consistent
cache boost (a good thing!).
Note that performance of "sort-by-skg" with termFacetCache is comparable to the
performance of simple sort-by-count pre-termFacetCache, and consistent across
field and domain cardinalities.
> Caching for term facet counts
> -----------------------------
>
> Key: SOLR-13807
> URL: https://issues.apache.org/jira/browse/SOLR-13807
> Project: Solr
> Issue Type: New Feature
> Components: Facet Module
> Affects Versions: master (9.0), 8.2
> Reporter: Michael Gibney
> Priority: Minor
> Attachments: SOLR-13807-benchmarks.tgz,
> SOLR-13807__SOLR-13132_test_stub.patch
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> Solr does not have a facet count cache; so for _every_ request, term facets
> are recalculated for _every_ (facet) field, by iterating over _every_ field
> value for _every_ doc in the result domain, and incrementing the associated
> count.
> As a result, subsequent requests end up redoing a lot of the same work,
> including all associated object allocation, GC, etc. This situation could
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet
> calculation, latency is proportional to the size of the result domain.
> Consequently, one common/clear manifestation of this issue is high latency
> for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be
> observed on a top-level landing page that exposes facets. This type of
> "static" case is often mitigated by external (to Solr) caching, either with a
> caching layer between Solr and a front-end application, or within a front-end
> application, or even with a caching layer between the end user and a
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the
> stack (or, for a new user, even being aware of this as a potential issue to
> mitigate), any external caching mitigation is really only appropriate for
> relatively static cases like the "landing page" example described above. A
> Solr-internal facet count cache (analogous to the {{filterCache}}) would
> provide the following additional benefits:
> # ease of use/out-of-the-box configuration to address a common performance
> concern
> # compact (specifically caching count arrays, without the extra baggage that
> accompanies a naive external caching approach)
> # NRT-friendly (could be implemented to be segment-aware)
> # modular, capable of reusing the same cached values in conjunction with
> variant requests over the same result domain (this would support common use
> cases like paging, but also potentially more interesting direct uses of
> facets).
> # could be used for distributed refinement (i.e., if facet counts over a
> given domain are cached, a refinement request could simply look up the
> ordinal value for each enumerated term and directly grab the count out of the
> count array that was cached during the first phase of facet calculation)
> # composable (e.g., in aggregate functions that calculate values based on
> facet counts across different domains, like SKG/relatedness – see SOLR-13132)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]