[
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17128463#comment-17128463
]
Michael Gibney commented on SOLR-13132:
---------------------------------------
Regarding: a6b1c60e61563535d7ba67c17d74f2bada6f80a2: the {{ClassCastException}}
was what motivated this change, but the solution was misguided (for all the
reasons you've identified, and possibly more...). In any case I think the hard
cast would be a problem, since subclasses attempting to disable sweep
collection by setting a non-sweep {{countAcc}} would have hit
{{ClassCastException}} on the hard cast anyway!
I just pushed a commit (0f2c01987e36e24179b192ca8373ab52498d75e5) that follows
something that I think is like the approach you're suggesting: selectively
wrapping {{countAcc}} to shim the mismatch between sweep and non-sweep code,
while respecting sweep/non-sweep preferences as implicitly asserted through the
sweep compatibility of {{countAcc}}.
Regarding 61befab60696dc4267ab9c96e36bc266c93a2fc3: Yes, I'm still considering
as well. Another possibility possibly (\?) no more invasive than adding a
{{SlotContext}} param to {{incrementCount(...)}}, would be to add a param
({{SlotContext}}, {{boolean}}?) to {{setValues(...)}} allowing {{allBuckets}}
to be skipped on _output_. Another possibility is that for sweep collection, a
case could be made that relatedness output is meaningful in a way that it
isn't/can't be for non-sweep collection. So _if_ we'd be ok with {{allBuckets}}
skg being supported for sweep collection but not for non-sweep, we could just
output it if it's there (for sweep), rather than jump through hoops to disable
arguably-meaningful output in the service of keeping responses identical
between sweep and non-sweep collection?
> Improve JSON "terms" facet performance when sorted by relatedness
> ------------------------------------------------------------------
>
> Key: SOLR-13132
> URL: https://issues.apache.org/jira/browse/SOLR-13132
> Project: Solr
> Issue Type: Improvement
> Components: Facet Module
> Affects Versions: 7.4, master (9.0)
> Reporter: Michael Gibney
> Priority: Major
> Attachments: SOLR-13132-with-cache-01.patch,
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
> Time Spent: 1.5h
> Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate
> {{relatedness}} for every term.
> The current implementation uses a standard uninverted approach (either
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain
> base docSet, and then uses that initial pass as a pre-filter for a
> second-pass, inverted approach of fetching docSets for each relevant term
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and
> set intersection operations increases request latency to the point where
> relatedness sort may not be usable in practice (for my use case, even after
> applying the patch for SOLR-13108, for a field with ~220k unique terms per
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable
> ~300ms and ~250ms respectively. The approach calculates uninverted facet
> counts over domain base, foreground, and background docSets in parallel in a
> single pass. This allows us to take advantage of the efficiencies built into
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids
> the per-term docSet creation and set intersection overhead.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]