[ https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17153994#comment-17153994 ]
Michael Gibney commented on SOLR-13132: --------------------------------------- I just attached [^SOLR-13132-benchmarks.tgz], some naive/sanity-check benchmarks. The results below are mostly for without any filterCache, but I included hooks in the included scripts to easily change the filterCache size for evaluation. For purpose of this evaluation, fgSet == base search result domain. All results discussed here are for single-valued string fields, but multivalued string fields are also included in the benchmark attachment (results for multi-valued didn't differ substantially from those for single-valued). There's a row for each docset domain recall percentage (percentage of \*:* domain returned by main query/fg), and a column for each field cardinality; cell values indicate latency (QTime) in ms against a single core with 3 million docs, no deletes; each value is the average of 10 repeated invocations of the the relevant request (standard deviation isn't captured here, but was quite low, fwiw). {code:java} MASTER: cdnlty: 10 100 1k 10k 100k 1m .1% 23 32 75 124 118 109 1% 28 43 98 412 1057 582 10% 78 90 135 430 3291 5530 20% 139 151 192 484 3353 7610 30% 198 207 250 538 3462 9537 40% 253 265 307 595 3474 11206 50% 321 341 374 655 3497 13155 {code} {code:java} SOLR-13132 sweep_collection=false, filterCacheSize=0 cdnlty: 10 100 1k 10k 100k 1m .1% 24 37 74 119 108 74 1% 29 44 97 403 1021 563 10% 77 92 136 417 3161 5356 20% 144 156 197 486 3233 7257 30% 199 209 254 534 3322 9224 40% 254 276 314 599 3393 10937 50% 323 352 368 643 3403 12718 {code} {code:java} SOLR-13132 sweep_collection=true cdnlty: 10 100 1k 10k 100k 1m .1% 99 99 106 111 114 145 1% 102 106 112 110 122 157 10% 168 173 177 175 193 241 20% 241 245 249 249 266 341 30% 307 312 318 313 337 409 40% 382 386 390 390 414 494 50% 449 455 459 460 487 569 {code} {code:java} SOLR-13132 sweep_collection=false, filterCacheSize=12000 (very large!) .1% 21 17 33 45 37 57 1% 7 17 41 270 885 525 10% 35 44 57 375 3239 5866 20% 71 78 89 204 3324 7800 30% 97 108 120 220 3335 9684 40% 130 141 152 248 3258 11582 50% 159 171 183 270 3299 13754 {code} The last of these is with a _very_ large filterCache configured. It pretty clearly benefits cardinality through 10k, but has a slight negative impact on 100k and above (filterCache overhead with no benefit). This is as expected; it's worth noting that in real-world situations, filterCache is likely to be smaller and also used by other features, so this test probably underestimates the negative system-wide impact of an undersized filterCache, and also underestimates the sufficient filterCache size threshold wrt field cardinality. Because there were low-level changes to the code to collect counts, I also did a sanity check comparing simple facet term count performance (no skg) of master wrt SOLR-13132. I won't post the results in line here, but there appeared to be no change whatsoever. The results are included in the attached tar file (under the "results" directory). Side note: the negative performance impact of sweep collection for low-cardinality docsets is mainly due to the fact that the full count accumulation domain (for sweep collection) becomes the union of the result DocSet, fg DocSet, and bg DocSet. Where bg DocSet is often large (e.g., \*:*), we're essentially just seeing the effect of collecting term facet counts over a high cardinality DocSet domain. As a contrived example, setting an artificially restricted bgSet (e.g., {{\{!prefix f=id v=999999}}}) is considerably faster: {code:java} SOLR-13132 sweep_collection=true, tiny bgSet cdnlty: 10 100 1k 10k 100k 1m .1% 2 1 1 2 2 8 1% 4 6 6 6 9 17 10% 50 48 46 47 55 78 20% 91 92 93 93 104 146 30% 135 137 138 141 152 198 40% 180 182 185 186 198 253 50% 223 226 228 229 245 316 {code} > Improve JSON "terms" facet performance when sorted by relatedness > ------------------------------------------------------------------ > > Key: SOLR-13132 > URL: https://issues.apache.org/jira/browse/SOLR-13132 > Project: Solr > Issue Type: Improvement > Components: Facet Module > Affects Versions: 7.4, master (9.0) > Reporter: Michael Gibney > Priority: Major > Attachments: SOLR-13132-benchmarks.tgz, > SOLR-13132-with-cache-01.patch, SOLR-13132-with-cache.patch, > SOLR-13132.patch, SOLR-13132_testSweep.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate > {{relatedness}} for every term. > The current implementation uses a standard uninverted approach (either > {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain > base docSet, and then uses that initial pass as a pre-filter for a > second-pass, inverted approach of fetching docSets for each relevant term > (i.e., {{count > minCount}}?) and calculating intersection size of those sets > with the domain base docSet. > Over high-cardinality fields, the overhead of per-term docSet creation and > set intersection operations increases request latency to the point where > relatedness sort may not be usable in practice (for my use case, even after > applying the patch for SOLR-13108, for a field with ~220k unique terms per > core, QTime for high-cardinality domain docSets were, e.g.: cardinality > 1816684=9000ms, cardinality 5032902=18000ms). > The attached patch brings the above example QTimes down to a manageable > ~300ms and ~250ms respectively. The approach calculates uninverted facet > counts over domain base, foreground, and background docSets in parallel in a > single pass. This allows us to take advantage of the efficiencies built into > the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids > the per-term docSet creation and set intersection overhead. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org