[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Chris M. Hostetter (Jira) Wed, 29 Apr 2020 15:26:17 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17095926#comment-17095926
 ]


Chris M. Hostetter commented on SOLR-13132:
-------------------------------------------

{quote}Thanks, Hoss! Some initial responses re: some of the nocommit comments 
from 8fcd6271b6: ...
{quote}
yeah, sorry for any folks following along on the jira – i thought i posted a 
comment about this last week:
 * in an offline email with Michael about setting up a branch to more closely 
iterate on, he pointed out that the way github PRs are setup I already had 
permission to push commits to hte branch he was using for his PR
 * so i started committing some test work and small edits with 
refactorings/comments/quesions directly to his branch

{quote}... my initial inclination is to prefer leaving SweepingAcc as a 
separate class, because CountSlotAcc currently clearly does one specific thing, 
...
{quote}
Well, except that with the changes we're making, CountSlotAcc is no longer just 
doing one specific thing (counting) – it's also managing any sweep collectors 
that exist.

I think we need .... something ... to help clean up the API a bit. Not only are 
the internals of {{CountSlotAcc.getBaseSweepingAcc(...)}} really messy and 
code-smell-ish right now, but as i discovered when i added in "whitebox" 
testing that inspects the facet debug output: since 
{{getBaseSweepingAcc().setValues(...)}} has to be called from 
{{FacetFieldProcessor.fillBucketFromSlot()}}, that means even non-sweep 
processors (like "hashdv" for example) wind up initializing {{SweepingAcc}} 
instances.

more concrete thoughts on possible ways to clean up this API / seperation of 
concerns as part of my next comment below...

Re: {{CountSlotAcc implements SweepableSlotAcc<CountSlotAcc>}}
{quote}... my thinking was: although countAcc is currently the one and only 
CountSlotAcc, used to accumulate counts over the base domain DocSet only, there 
could be cases where extra CountSlotAccs are used more directly (e.g. as part 
of stats collection, analogous to how they're used indirectly for SKG sweep 
collection). In such a case, these "non-base" CountSlotAccs would respond as 
implemented
{quote}
that's totally fair – if you can envision possible usages for "extra" 
{{CountSlotAcc}} instances down the road that act as {{SweepableSlotAcc}} 
instances "hanging off" the "main" {{countAcc}} then I'm fine with that, since 
it doesn't really hurt anything and it's a "clean" implementation of the API: 
there's really no reason why all {{CountSlotAcc}} impls can't {{implements 
SweepableSlotAcc<CountSlotAcc>}}.

But taking a step back on the topic of "API cleanliness", i think the direction 
i'm leaning at the moment is that {{SweepingAcc}} should really just be 
{{SweepingCountSlotAcc extends CountSlotArrAcc}} ... a new concrete 
{{CountSlotAcc}} subclass that would be instantiated by processors that support 
sweeping for use as {{this.countAcc}}, and would inherently & automatically 
handles all the "sweep" related logic currently spread between {{SweepingAcc}} 
and {{CountSlotAcc.getBaseSweepingAcc(...)}} (and might eliminate the need for 
{{FacetFieldProcessorSlotAccMapper implements CollectSlotAccMappingAware}} ... 
? ...unless there's another use for {{CollectSlotAccMappingAware}} i'm 
overlooking?)

This would include making sure that {{countAcc.setValues(...)}} "does the right 
thing" when {{countAcc instanceof SweepingCountSlotAcc}} and there are 
registered "output" SlotAccs forit to handle...

Which brings me back to 2 related earlier comments/questions you had...
{quote}I think it's probably better to have collectAcc read access mediation be 
an "all-or-nothing" thing ... considering for example the MultiAcc case where 
some SlotAccs might be modified while others aren't, that separating them and 
calling setValues(...) on the modified and non-modified groups separately would 
affect the order of output. I think this would only affect the MultiAcc case, 
since all other SlotAccs in collectAcc would be all-or-nothing by nature (i.e., 
they'd either register replacement SlotAcc or not).
{quote}
1) I wouldn't worry about the order of keys – this is already non-deterministic 
on master -- not only can changing the "sort" or "prelim_sort" affect the order 
of the keys currently (due to how the "collectAcc" SlotAcc gets treated 
distinctly from "otherAccs") but even things like changing "method" or "limit" 
can cause the keys to come in diff orders (either because a switching to the 
"enum" method, or because of using MultiAcc when doing a singlePass collection)

(This is something i noticed along time ago and forgot about until it started 
biting me in the ass in the randomized test i committed to your branch – 
independent of wether sweeping is used ... i have not yet decided yet if it 
seems better/easier to make the test not care about the order of the keys, or 
to bite the bullet and enforce a deterministic over in {{FacetProcessor}} in a 
new jira)
{quote}For now, rather than override setValues(...) in CountSlotAcc, I left it 
for supporting FacetFieldProcessors to explicitly call 
countAcc.getBaseSweepingAcc().setValues(...). Currently, countAcc values seem 
to generally be accessed directly via countAcc.getCount(...), rather than via 
their setValues(...) method – I wasn't sure about switching all those access 
patterns over – thoughts?
{quote}
2) when i first read this comment, i was on the fence and leaning towards 
leaving hte code alone – but the more i think about it – particularly in the 
context of good separation of concerns – the more i think cleaning the code up 
to consistently use {{countAcc.setValues(...)}} so we can override it is a 
better approach. if the caller also needs to use {{countAcc.getCount(...)}} to 
decide when to recurse, so be it.

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to