[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Michael Gibney (Jira) Wed, 08 Jul 2020 13:47:05 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153994#comment-17153994
 ]


Michael Gibney commented on SOLR-13132:
---------------------------------------

I just attached [^SOLR-13132-benchmarks.tgz], some naive/sanity-check 
benchmarks. The results below are mostly for without any filterCache, but I 
included hooks in the included scripts to easily change the filterCache size 
for evaluation. For purpose of this evaluation, fgSet == base search result 
domain. All results discussed here are for single-valued string fields, but 
multivalued string fields are also included in the benchmark attachment 
(results for multi-valued didn't differ substantially from those for 
single-valued).

There's a row for each docset domain recall percentage (percentage of \*:* 
domain returned by main query/fg), and a column for each field cardinality; 
cell values indicate latency (QTime) in ms against a single core with 3 million 
docs, no deletes; each value is the average of 10 repeated invocations of the 
the relevant request (standard deviation isn't captured here, but was quite 
low, fwiw).
{code:java}
MASTER:
cdnlty: 10      100     1k      10k     100k    1m
.1%     23      32      75      124     118     109
1%      28      43      98      412     1057    582
10%     78      90      135     430     3291    5530
20%     139     151     192     484     3353    7610
30%     198     207     250     538     3462    9537
40%     253     265     307     595     3474    11206
50%     321     341     374     655     3497    13155
{code}
{code:java}
SOLR-13132 sweep_collection=false, filterCacheSize=0
cdnlty: 10      100     1k      10k     100k    1m
.1%     24      37      74      119     108     74
1%      29      44      97      403     1021    563
10%     77      92      136     417     3161    5356
20%     144     156     197     486     3233    7257
30%     199     209     254     534     3322    9224
40%     254     276     314     599     3393    10937
50%     323     352     368     643     3403    12718
{code}
{code:java}
SOLR-13132 sweep_collection=true
cdnlty: 10      100     1k      10k     100k    1m
.1%     99      99      106     111     114     145
1%      102     106     112     110     122     157
10%     168     173     177     175     193     241
20%     241     245     249     249     266     341
30%     307     312     318     313     337     409
40%     382     386     390     390     414     494
50%     449     455     459     460     487     569
{code}
{code:java}
SOLR-13132 sweep_collection=false, filterCacheSize=12000 (very large!)
.1%     21      17      33      45      37      57
1%      7       17      41      270     885     525
10%     35      44      57      375     3239    5866
20%     71      78      89      204     3324    7800
30%     97      108     120     220     3335    9684
40%     130     141     152     248     3258    11582
50%     159     171     183     270     3299    13754
{code}
The last of these is with a _very_ large filterCache configured. It pretty 
clearly benefits cardinality through 10k, but has a slight negative impact on 
100k and above (filterCache overhead with no benefit). This is as expected; 
it's worth noting that in real-world situations, filterCache is likely to be 
smaller and also used by other features, so this test probably underestimates 
the negative system-wide impact of an undersized filterCache, and also 
underestimates the sufficient filterCache size threshold wrt field cardinality.

Because there were low-level changes to the code to collect counts, I also did 
a sanity check comparing simple facet term count performance (no skg) of master 
wrt SOLR-13132. I won't post the results in line here, but there appeared to be 
no change whatsoever. The results are included in the attached tar file (under 
the "results" directory).

Side note: the negative performance impact of sweep collection for 
low-cardinality docsets is mainly due to the fact that the full count 
accumulation domain (for sweep collection) becomes the union of the result 
DocSet, fg DocSet, and bg DocSet. Where bg DocSet is often large (e.g., \*:*), 
we're essentially just seeing the effect of collecting term facet counts over a 
high cardinality DocSet domain. As a contrived example, setting an artificially 
restricted bgSet (e.g., {{\{!prefix f=id v=999999}}}) is considerably faster:
{code:java}
SOLR-13132 sweep_collection=true, tiny bgSet
cdnlty: 10      100     1k      10k     100k    1m
.1%     2       1       1       2       2       8
1%      4       6       6       6       9       17
10%     50      48      46      47      55      78
20%     91      92      93      93      104     146
30%     135     137     138     141     152     198
40%     180     182     185     186     198     253
50%     223     226     228     229     245     316
{code}

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-benchmarks.tgz, 
> SOLR-13132-with-cache-01.patch, SOLR-13132-with-cache.patch, 
> SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to