[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Michael Gibney (Jira) Fri, 17 Apr 2020 12:28:22 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17086013#comment-17086013
 ]


Michael Gibney commented on SOLR-13132:
---------------------------------------

Yes! I just pushed a change that handles "reading" from full-domain 
{{SlotAccs}} (of the type previously only collected via {{collectAcc}}) in a 
way analogous to how modifications to {{collectAcc}} are now done on the 
"write" side (pre-{{collectDocs()}}): coordinated via the {{SweepingAcc}} 
returned by {{countAcc.getBaseSweepingAcc()}}.

To kind of summarize/paraphrase the current state of things:
 * {{countAcc}} is always present, and already treated as a special case, 
always used in all {{FacetFieldProcessors}} for the accumulation of facet 
counts. As such, it's a natural place to expose/access the data structures for 
coordinating communication between sweep collection, full-domain ("sweepable") 
{{SlotAccs}} (which are initially placed in {{collectAcc}}), and 
{{setValues(...)}} on those same (logical) full-domain SlotAccs.
 * {{collectAcc}} has historically served as both the receiver of "collect" 
calls to accumulate across the full domain, and as the source for retrieving 
the values set during the collect phase (via {{setValues(...)}}). Additionally, 
it has served as a kind of boolean signal (null, not-null) indicating whether 
"collect" calls are necessary for accumulation, supporting optimizations by 
selecting different code paths in some processors.
 * Going forward, pre-{{collectDocs()}}, {{FacetFieldProcessors}} will have the 
opportunity to request that their write and read access to {{collectAcc}} be 
modified and mediated (respectively) by a unique {{SweepingAcc}} instance 
retrieved from the {{FacetFieldProcessor}}'s unique {{countAcc}} instance. If 
the {{FacetFieldProcessor}} _doesn't_ call 
{{collectAcc.registerSweepingAccs(...)}}, {{collectAcc}} will continue to 
default to being used exactly as before. It is the responsibility of 
{{FacetFieldProcessors}} that _do_ call 
{{collectAcc.registerSweepingAccs(...)}} to ensure that subsequent access to 
full-domain {{SlotAccs}} is first attempted via 
{{countAcc.getBaseSweepingAcc().setValues(...)}}, giving the opportunity for 
the base SweepingAcc to mediate read access as appropriate. 
{{FacetFieldProcessors}} that _don't_ call 
{{collectAcc.registerSweepingAccs(...)}} then don't need to know (or care) 
anything about sweep collection.

I _think_ it's probably better to have {{collectAcc}} read access mediation be 
an "all-or-nothing" thing ... considering for example the {{MultiAcc}} case 
where some {{SlotAccs}} might be modified while others aren't, that separating 
them and calling {{setValues(...)}} on the modified and non-modified groups 
separately would affect the order of output. I think (?) this would only affect 
the {{MultiAcc}} case, since all other {{SlotAccs}} in {{collectAcc}} would be 
all-or-nothing by nature (i.e., they'd either register replacement {{SlotAcc}} 
or not).

For now, rather than override {{setValues(...)}} in {{CountSlotAcc}}, I left it 
for supporting {{FacetFieldProcessors}} to explicitly call 
{{countAcc.getBaseSweepingAcc().setValues(...).}} Currently, {{countAcc}} 
values seem to generally be accessed directly via {{countAcc.getCount(...)}}, 
rather than via their {{setValues(...)}} method – I wasn't sure about switching 
all those access patterns over – thoughts?

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to