[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Michael Gibney (Jira) Thu, 09 Jul 2020 08:20:18 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17154661#comment-17154661
 ]


Michael Gibney commented on SOLR-13132:
---------------------------------------

Sorry, yes; the "MASTER" results were for "filterCacheSize=0", so 
apples-to-apples with "SOLR-13132 sweep_collection=false, filterCacheSize=0". 
And yes, I'll update the ref guide shortly.

bq.Assuming i'm understanding correctly...

Yes, that's my takeaway as well.

My only remaining questions are around what's considered a "common" vs. 
"uncommon" case, and regarding the negative impact of sweep collection on 
low-cardinality fields, what impact we consider to be "small". To explore this 
a little bit: I think it's hard to say what the common vs. uncommon use case 
is. But the worst-case negative impact of sweep collection (disregarding 
filterCache) is ~4x, for very-low-cardinality fields over low-recall FG sets, 
which are likely among the fastest queries in an absolute sense. This seems 
acceptable to me.

Considering the performance boost that filterCache can in some cases provide to 
non-sweep collection, the worst-case negative performance impact can go to 
~100x ... _but_ I still think that's ok, because it makes sense to consider 
reliance on filterCache as an opt-in performance optimization (analogous to how 
the {{enum}} facet method can outperform {{dv}} faceting for low-cardinality 
fields and a sufficiently-sized filterCache). Relying on filterCache in these 
cases can yield significant performance benefits, but is very 
situation-specific, and should be approached carefully to avoid system-wide 
negative effects. So particularly pending some way to make filterCache use more 
selective (e.g., SOLR-13108) it makes sense to default to sweep collection 
_even if only_ because it avoids accidental filterCache thrashing.

... all that being a long way of saying "yes, I think we're good to go". Now 
I'll go transform that into something refGuide-appropriate


> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-benchmarks.tgz, 
> SOLR-13132-with-cache-01.patch, SOLR-13132-with-cache.patch, 
> SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to