[jira] [Commented] (SOLR-13807) Caching for term facet counts

Michael Gibney (Jira) Mon, 17 Aug 2020 14:29:22 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17179234#comment-17179234
 ]


Michael Gibney commented on SOLR-13807:
---------------------------------------

aba1d18797c99b46edf211ff26b989a4d23f625b adds the ability to configure docset 
domain size threshold ({{countCacheDf}} for consultation of the termFacetCache) 
separately for base domain, fgSet, and bgSet. This is practically useful for 
the common case where one might want to dedicate the termFacetCache primarily 
to caching (large, static) bgSets, but accumulate facet counts for base and 
fgSet "normally" (i.e., without consulting the termFacetCache). One nice 
side-effect is that this allows the benchmarks above to demonstrate in a more 
isolated way the different potential benefits from the termFacetCache wrt 
sort-by-{{relatedness}}.

{code}
[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s true # no cache/cache 
maxSize=0
cdnlty: 10      100     1k      10k     100k    1m
.1%     86      85      90      90      93      130
1%      89      87      92      91      96      138
10%     128     122     125     125     145     188
20%     145     137     142     143     162     242
30%     164     153     160     166     180     266
40%     186     176     180     180     201     295
50%     206     188     193     197     216     326
99.99%  198     196     200     200     220     332

[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s true # all caching disabled 
at query level (countCacheDf=-1)
cdnlty: 10      100     1k      10k     100k    1m
.1%     88      86      91      89      94      121
1%      89      87      91      92      96      131
10%     124     122     127     126     146     186
20%     142     139     144     144     166     241
30%     160     156     162     160     213     264
40%     181     180     183     183     206     301
50%     196     195     197     196     216     324
99.99%  199     195     202     204     223     331

[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s true # bg-countCacheDf=0 
(default; cache size->6 -- 1 per field)
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       0       1       4
1%      1       1       1       2       5       8
10%     22      22      28      25      37      61
20%     42      42      43      44      59      99
30%     61      60      62      63      82      129
40%     80      80      84      84      101     165
50%     99      98      101     102     122     199
99.99%  122     118     123     124     142     231

[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s true # base, fg, bg 
countCacheDf=0 (default; cache size->84)
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       1       1       9
1%      0       0       0       1       4       12
10%     3       3       3       4       13      38
20%     6       6       6       7       14      47
30%     9       8       8       9       17      56
40%     11      10      11      10      21      64
50%     14      13      13      14      21      75
99.99%  25      24      24      25      34      108
{code}

The first two above demonstrate parity between disabling caching at the 
collection config level and at the query level. The third illustrates the 
result of consulting the termFacetCache only for the bgSet. The last consults 
the termFacetCache across the board (base, fg, bg). One takeaway here is that 
even a _very_ modestly-configured termFacetCache gives ~10x performance boost 
for sort-by-skg against low-recall base domains, and a "worst"-case boost (for 
high-cardinality domains) of ~33% -- and because of the static nature of the 
bgSet, this boost can be expected to be consistent (unlike hit-or-miss gains 
that are generally characteristic of caching).

The first two count-only benchmarks (below) demonstrates analogous parity for 
different ways of disabled caching, and the third below shows significant 
performance boost for count-only (sort-by-count) with default termFacetCache 
consultation.

{code}
[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s count # no cache/maxSize=0
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       0       0       4
1%      1       1       1       1       1       5
10%     9       8       9       8       11      15
20%     16      15      17      18      20      32
30%     23      21      23      22      26      42
40%     29      28      31      29      34      54
50%     37      34      37      35      41      65
99.99%  70      63      70      67      74      112

[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s count # caching disabled at 
query time (countCacheDf=-1)
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       0       0       4
1%      1       1       1       1       2       6
10%     7       8       8       9       11      14
20%     16      15      17      17      20      32
30%     23      21      22      21      27      42
40%     29      27      31      28      33      52
50%     35      33      39      35      43      63
99.99%  69      64      71      67      76      110

[magibney@mbp SOLR-13807-benchmarks]$ ./check.sh s count # countCacheDf=0 
(default behavior; cache size->42)
cdnlty: 10      100     1k      10k     100k    1m
.1%     0       0       0       0       0       3
1%      0       0       0       0       1       4
10%     2       3       2       3       4       10
20%     6       6       6       6       7       13
30%     9       8       8       9       9       16
40%     12      11      11      12      13      19
50%     14      14      13      14      15      21
99.99%  27      25      26      25      27      35
{code}

> Caching for term facet counts
> -----------------------------
>
>                 Key: SOLR-13807
>                 URL: https://issues.apache.org/jira/browse/SOLR-13807
>             Project: Solr
>          Issue Type: New Feature
>          Components: Facet Module
>    Affects Versions: master (9.0), 8.2
>            Reporter: Michael Gibney
>            Priority: Minor
>         Attachments: SOLR-13807-benchmarks.tgz, 
> SOLR-13807__SOLR-13132_test_stub.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Solr does not have a facet count cache; so for _every_ request, term facets 
> are recalculated for _every_ (facet) field, by iterating over _every_ field 
> value for _every_ doc in the result domain, and incrementing the associated 
> count.
> As a result, subsequent requests end up redoing a lot of the same work, 
> including all associated object allocation, GC, etc. This situation could 
> benefit from integrated caching.
> Because of the domain-based, serial/iterative nature of term facet 
> calculation, latency is proportional to the size of the result domain. 
> Consequently, one common/clear manifestation of this issue is high latency 
> for faceting over an unrestricted domain (e.g., {{\*:\*}}), as might be 
> observed on a top-level landing page that exposes facets. This type of 
> "static" case is often mitigated by external (to Solr) caching, either with a 
> caching layer between Solr and a front-end application, or within a front-end 
> application, or even with a caching layer between the end user and a 
> front-end application.
> But in addition to the overhead of handling this caching elsewhere in the 
> stack (or, for a new user, even being aware of this as a potential issue to 
> mitigate), any external caching mitigation is really only appropriate for 
> relatively static cases like the "landing page" example described above. A 
> Solr-internal facet count cache (analogous to the {{filterCache}}) would 
> provide the following additional benefits:
>  # ease of use/out-of-the-box configuration to address a common performance 
> concern
>  # compact (specifically caching count arrays, without the extra baggage that 
> accompanies a naive external caching approach)
>  # NRT-friendly (could be implemented to be segment-aware)
>  # modular, capable of reusing the same cached values in conjunction with 
> variant requests over the same result domain (this would support common use 
> cases like paging, but also potentially more interesting direct uses of 
> facets). 
>  # could be used for distributed refinement (i.e., if facet counts over a 
> given domain are cached, a refinement request could simply look up the 
> ordinal value for each enumerated term and directly grab the count out of the 
> count array that was cached during the first phase of facet calculation)
>  # composable (e.g., in aggregate functions that calculate values based on 
> facet counts across different domains, like SKG/relatedness – see SOLR-13132)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13807) Caching for term facet counts

Reply via email to