[ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175836#comment-17175836
 ] 

Michael Gibney commented on SOLR-13807:
---------------------------------------

After SOLR-13132 was merged to master, it was a bit of challenge to reconcile 
with the complementary "term facet cache" (this issue). I've taken an initial 
stab at this and pushed to [PR 
#1357|https://github.com/apache/lucene-solr/pull/1357], and I think it's at the 
point where it's once again ready for consideration.

Below are some naive performance benchmarks, using [^SOLR-13807-benchmarks.tgz] 
(based on similar benchmarks for SOLR-13132).

{{filterCache}} is irrelevant for what's illustrated here (all count or sweep 
collection, single-shard thus no refinement). I included hooks in the included 
scripts to easily change the filterCache size and termFacetCache size for 
evaluation. For purpose of {{relatedness}} evaluation, fgSet == base search 
result domain. All results discussed here are for single-valued string fields, 
but multivalued string fields are also included in the benchmark attachment 
(results for multi-valued didn't differ substantially from those for 
single-valued).

There's a row for each docset domain recall percentage (percentage of \*:* 
domain returned by main query/fg), and a column for each field cardinality; 
cell values indicate latency (QTime) in ms against a single core with 3 million 
docs, no deletes; each value is the average of 10 repeated invocations of the 
the relevant request (standard deviation isn't captured here, but was quite 
low, fwiw).

Below are for current (including SOLR-13132) master; no caches (filterCache, if 
present, would be unused):
{code}
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, master
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       0       0       4
1%      1       0       1       1       2       5
10%     7       7       8       8       10      16
20%     17      14      16      15      19      31
30%     22      19      23      20      24      42
40%     27      26      28      28      32      50
50%     33      32      35      32      38      59
99.99%  65      60      67      62      72      107

[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, master
cdnlty: 10      100     1k      10k     100k    1m
.1%     179     174     183     190     192     225
1%      182     177     186     183     194     236
10%     193     191     196     197     226     256
20%     206     200     207     207     234     300
30%     216     210     217     216     239     316
40%     228     225     231     231     253     331
50%     239     234     241     240     266     347
99.99%  285     280     287     287     311     403
{code}

Below are for 77daac4ae2a4d1c40652eafbbdb42b582fe2d02d (SOLR-13807), with _no_ 
termFacetCache configured (apples-to-apples, since there are changes in some of 
the hot facet code paths):
{code}
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, 
no_cache
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       0       0       3
1%      1       1       1       1       1       6
10%     8       8       9       8       11      14
20%     16      15      16      15      20      32
30%     21      21      23      22      26      42
40%     28      27      31      28      34      53
50%     35      33      37      34      40      63
99.99%  68      64      71      66      74      108

[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, no_cache
cdnlty: 10      100     1k      10k     100k    1m
.1%     96      80      89      97      96      129
1%      88      83      90      88      101     133
10%     99      97      103     102     122     162
20%     117     107     113     113     135     194
30%     120     117     123     122     144     211
40%     130     129     134     134     156     232
50%     143     140     147     144     169     249
99.99%  179     175     181     179     201     305
{code}

Below are for 77daac4ae2a4d1c40652eafbbdb42b582fe2d02d (SOLR-13807), with 
{{solr.termFacetCacheSize=20}} configured.
{code}
[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s count # sort-by-count, cache 
size 20
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       0       0       2
1%      0       0       0       0       1       10
10%     3       4       4       4       5       16
20%     8       7       8       7       9       20
30%     11      10      12      11      13      25
40%     13      13      15      15      15      28
50%     15      16      16      18      20      32
99.99%  29      30      30      29      32      45

[magibney@mbp SOLR-13132-benchmarks]$ ./check.sh s true # sort-by-skg, cache 
size 20
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       0       1       6
1%      0       0       0       1       4       14
10%     3       4       4       5       11      33
20%     9       8       8       8       16      41
30%     10      10      11      12      17      51
40%     13      13      13      14      20      61
50%     16      15      17      17      23      69
99.99%  30      28      30      30      37      101
{code}

The performance boost for sort-by-count has all the normal caveats of any type 
of caching, but could result in huge practical performance benefits for "main 
index page" and/or paging requests that use facets.

The performance boost for sort-by-skg, on the other hand, in many cases even 
transcends normal caching caveats (assuming sweep collection and a relatively 
static "background set"). With sweep collection, the common-case background set 
of \*:*, e.g., would be cached and used repeatedly even with a minimal 
termFacetCache (say, size=10), making for an uncharacteristically consistent 
cache boost (a good thing!).

Note that performance of "sort-by-skg" with termFacetCache is comparable to the 
performance of simple sort-by-count pre-termFacetCache, and consistent across 
field and domain cardinalities.

> Caching for term facet counts
> -----------------------------
>
>                 Key: SOLR-13807
>                 URL: https://issues.apache.org/jira/browse/SOLR-13807
>             Project: Solr
>          Issue Type: New Feature
>          Components: Facet Module
>    Affects Versions: master (9.0), 8.2
>            Reporter: Michael Gibney
>            Priority: Minor
>         Attachments: SOLR-13807-benchmarks.tgz, 
> SOLR-13807__SOLR-13132_test_stub.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> As a result, subsequent requests end up redoing a lot of the same work, 
> including all associated object allocation, GC, etc. This situation could 
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is high latency 
> for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be 
> observed on a top-level landing page that exposes facets. This type of 
> "static" case is often mitigated by external (to Solr) caching, either with a 
> caching layer between Solr and a front-end application, or within a front-end 
> application, or even with a caching layer between the end user and a 
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the 
> stack (or, for a new user, even being aware of this as a potential issue to 
> mitigate), any external caching mitigation is really only appropriate for 
> relatively static cases like the "landing page" example described above. A 
> Solr-internal facet count cache (analogous to the {{filterCache}}) would 
> provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance 
> concern
>  # compact (specifically caching count arrays, without the extra baggage that 
> accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with 
> variant requests over the same result domain (this would support common use 
> cases like paging, but also potentially more interesting direct uses of 
> facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a 
> given domain are cached, a refinement request could simply look up the 
> ordinal value for each enumerated term and directly grab the count out of the 
> count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on 
> facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to