[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Chris M. Hostetter (Jira) Thu, 16 Apr 2020 12:52:15 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17085216#comment-17085216
 ]


Chris M. Hostetter commented on SOLR-13132:
-------------------------------------------

Michael: I'm still working my way through the latest changes, but i'm mainly 
surprised by this new {{fullDomainAccs}} concept (my prior suggestions were 
trying to reduce the number of impacts on FacetFieldProcessor API and it's 
impls that don't care about sweeping, this seems to have just "changed" the 
nature of impacts) ...
{quote}The main other change that followed as a consequence of this work is 
that I think it's necessary (if we may now replace {{collectAcc}} with null 
when full domain collection can be accomplished via sweep count collection 
only) to explicitly separate the read-access to full-domain SlotAccs so that 
output can be handled appropriately. Formerly, collectAcc served for both 
collection _and_ read-access for such SlotAccs, but the 1:1 correspondence that 
made that work is no longer a valid assumption; and can't use the other 
existing references that get read for output ({{accs}}, {{otherAccs}}, 
{{deferredAggs}}, etc.) b/c they're handled quite differently (on a 1-slot, 
1-bucket-at-a-time a la carte way). I could be missing something here, but in 
any event that's what explains the introduction of the new 
{{FacetFieldProcessor.fullDomainAccs}} field.
{quote}
Interesting ... yeah, i guess i hadn't really considered how to get values out 
of the {{SweepingAccs}} once we removed them from the {{collectAcc}}.

It seems weird that the processors now have to keep track of a list of 
{{List<SlotAcc> fullDomainAccs}} to call {{setValues}} on later when the 
{{countAcc}} is already tracking those ~same~ (logical) {{SlotAcc}} instances 
in the form of {{SweepingAcc}} instances - so perhaps it would be 
cleaner/simpler if:
 * {{SweepingAcc}} defined a {{setValues}} method (same method sig as 
{{SlotAcc}}) that by default loops over all the "other" {{SweepingAccs}} it 
wraps (similar to how {{MultiAcc.setValues}} works.
 * {{CountAcc}} overrides {{SlotAcc.setValues()}} to look something like...
{code:java}
@Override public void setValues(SimpleOrderedMap<Object> bucket, int slotNum) 
throws IOException {
  super.setValues(bucket, slotNum);
  baseSweepingAcc.setValues(bucket, slotNum);
}
{code}

...?

I'm not suggesting/requesting that you make this change right now ... i haven't 
thought it through enough to have any confidence if it's better/worse then what 
you've got at the moment ... just trying to talk it through and see WDYT since 
you're the most familiar with this code: do you think that would simplify the 
overall changes/impl and help keep the "sweeping" logic in only the classes 
that care/know about sweeping?

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to