[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Michael Gibney (Jira) Fri, 10 Apr 2020 13:18:28 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17080937#comment-17080937
 ]


Michael Gibney commented on SOLR-13132:
---------------------------------------

Yesterday and just now, pushed changes that I think address all the points 
you've brought up.
 # Removed the "XXX temporary!" comment. This related code is still important, 
so the comment was misleading
 # "{{CountSlotAccFactory}} interface seems like dead code" – it's providing a 
hook that will be used e.g. with the introduction of caching. Could indeed be 
removed at this point, but I've left it in for now since there was a lot of 
other more substantial refactoring and I wanted to make sure I didn't 
inadvertently do something that would make this harder to reintroduce later.
 # ReadOnlyCountSlotAcc is now an interface, not an abstract class w/ wrapper 
implementation. Good point :)
 # {{FilterCtStruct}} and {{SweepCountAccStruct}} were indeed essentially 
redundant. Removed the former.
 # Regarding "brittleness" all great suggestions. I basically tried to follow 
what you outlined as the "'cleaner' approach"; for now instead of adding 
{{registerSweepingAccs(...)}} directly to {{SlotAcc}} I initially left it as a 
new interface (processors checking by {{instanceof}}). Honestly I did this b/c 
I missed that you suggested adding the new method directly to {{SlotAcc}}. 
Perfectly happy to dispense with the special interface/instanceof checks and 
put it directly in {{SlotAcc}} of course!
 # Acted on advice re: explicitness of semantics for requesting sweep 
collection. Changed "disable_sweep_collection" (default "false" + confusing 
heuristics) to "sweep_collection", defaulting to "true" and acted upon only if 
supported by all components. (TODO: I'm pretty sure I haven't properly gotten 
the final chosen method into the facet debug output yet).
 # I _think_ I acted on all the advice about the weirdness of returning new 
arrays, placing the "base" collector in the final array position, etc. That did 
make things nicer I think. (related topic: it really is important to keep track 
of which is the base domain, b/c we're sweeping (for purposes of count 
collection) over a union domain, only some of which correspond to the base 
domain, which is the only domain for which subsequent collection may be 
necessary in the event that {{collectAcc!=null}}).

The main other change that followed as a consequence of this work is that I 
think it's necessary (if we may now replace {{collectAcc}} with null when full 
domain collection can be accomplished via sweep count collection only) to 
explicitly separate the read-access to full-domain SlotAccs so that output can 
be handled appropriately. Formerly, collectAcc served for both collection _and_ 
read-access for such SlotAccs, but the 1:1 correspondence that made that work 
is no longer a valid assumption; and can't use the other existing references 
that get read for output ({{accs}}, {{otherAccs}}, {{deferredAggs}}, etc.) b/c 
they're handled quite differently (on a 1-slot, 1-bucket-at-a-time a la carte 
way). I could be missing something here, but in any event that's what explains 
the introduction of the new {{FacetFieldProcessor.fullDomainAccs}} field.

I have not yet incorporated the testing patch that you provided above, but I 
wanted to post what I have now, since I _think_ it's fairly complete (pending 
additional testing!).

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to