[
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17136007#comment-17136007
]
Michael Gibney commented on SOLR-13132:
---------------------------------------
I just pushed some commits that should address many of the outstanding
nocommits (mostly some minor refactoring and added javadocs).
The thorniest issue remaining is I think that of caching (when to consult the
filterCache for non-sweep collection). Previously, every bucket-based query
(which, pre-sweep, was "all queries") consulted the filterCache – a serious
problem for "terms" facets over high-cardinality fields. The code to address
this in {{RelatednessAgg}} is carried over from work on SOLR-13108, roughly
adapting the way the (undocumented?) {{cacheDf}} parameter is respected in
{{FacetFieldProcessorByEnumTermsStream}}.
The approach currently in {{RelatednessAgg}} was ported as closely as possible
from the implementation in {{FacetFieldProcessorByEnumTermsStream}}, but
differs by necessity (right?) in that the latter can use the
{{DocsEnumState.minSetSizeCached}} over a single "slowAtomicReader()"-backed
{{TermsEnum}}, whereas in {{RelatednessAgg}}, the terms may arrive out of
order. The heuristic-based approach implemented in {{RelatednessAgg}} results
from my assumption that forward-only {{TermsEnum}} would make the
{{DocsEnumState}} approach a non-starter in the {{RelatednessAgg}} context. If
I'm wrong or missing something here, that would be great, since I definitely
would have preferred that approach, all else being equal! If on the other hand
this assumption is valid, I can think of two possibilities:
# Stick with caching everything. This would still be problematic for
_non-sweep_ collection, but sweep collection should "solve" the problem by
rendering it irrelevant for the default case. This would still be sub-optimal
for refinement, but would probably be something we could deal with. In any
case, this wouldn't make anything _worse_. ... or alternatively could _never_
consult filterCache, at least for {{TermQuery}}/{{FacetField}} and/or
refinement requests?
# Defer {{SKGSlotAcc.processSlot(...)}} until the end of the "collect" phase,
before reading values (via {{SlotAcc.setValues(...)}}. Collection would
"register" terms, which could be processed in a single index-order pass backed
by a single {{TermsEnum}}. This would probably be out of scope for this issue,
and I'm not sure it would work for, e.g., {{FacetFieldProcessorByHashDV}}, but
I figured I'd mention it here anyway...
> Improve JSON "terms" facet performance when sorted by relatedness
> ------------------------------------------------------------------
>
> Key: SOLR-13132
> URL: https://issues.apache.org/jira/browse/SOLR-13132
> Project: Solr
> Issue Type: Improvement
> Components: Facet Module
> Affects Versions: 7.4, master (9.0)
> Reporter: Michael Gibney
> Priority: Major
> Attachments: SOLR-13132-with-cache-01.patch,
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate
> {{relatedness}} for every term.
> The current implementation uses a standard uninverted approach (either
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain
> base docSet, and then uses that initial pass as a pre-filter for a
> second-pass, inverted approach of fetching docSets for each relevant term
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and
> set intersection operations increases request latency to the point where
> relatedness sort may not be usable in practice (for my use case, even after
> applying the patch for SOLR-13108, for a field with ~220k unique terms per
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable
> ~300ms and ~250ms respectively. The approach calculates uninverted facet
> counts over domain base, foreground, and background docSets in parallel in a
> single pass. This allows us to take advantage of the efficiencies built into
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids
> the per-term docSet creation and set intersection overhead.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]