[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Chris M. Hostetter (Jira) Mon, 22 Jun 2020 18:53:24 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17142526#comment-17142526
 ]


Chris M. Hostetter commented on SOLR-13132:
-------------------------------------------

michael: I'm really sorry, i've been swamped with some other stuff and have not 
yet had a chance to properly & fully catch up on all of your changes since my 
last comment – but i did want to pop in and say:
 * from what i can tell at a quick skim, i really like the direction you took 
with removing hte shim .. seems much simpler and more straight forward to 
understand
 * i'm a little concerned that as we've made various incremental changes,  
{{registerSweepingAccIfSupportedByCollectAcc(SweepingCountSlotAcc)}} has 
evolved into a multi-headed beast...
 ** the original point of this method was to refactor & dedpulicate some 
redundent code across multiple processors...
 *** the primary purpose was to give subclasses a helper method to call to 
ensure {{collectAcc.registerSweepingAccs(sweepingCountAcc)}} got called if: a) 
sweepingCountAcc was a thing, b) collectAcc supported sweeping
 *** add bonus: it eliminated the need for duplicate instanceof/casting checks 
that were in multiple places
 ** as the code & method signature have evolved, this is now called:
 *** in various places that still have to do their own instanceof/casting 
checks of SweepingCountSlotAcc when they call it
 *** very late in the "lifecycle" of the processors, in way that makes it 
un-obvious to me if it might be getting called multiple times for the same 
processor? (which may not be a probem now even if they do, but might cause 
weird bugs down the road if/when more things support sweeping)
 ** i think we should:
 *** eliminate the arg from this API, and make it operate directly on 
{{this.countAcc}} (if and only if it's an instanceof SweepingCountSlotAcc, 
otherwise this method should be a No-Op ... this is how it worked originally 
IIRC)
 *** move _where_ this method is called so that it is called as early, and 
infrequently, as possible ... is there any reason we can't make it the last 
line of {{FacetFieldProcessorByArray.createCollectAcc()}} ? (the only other 
place countAcc/collectAcc might be modified are the refinement path where we 
*know* explicitly we're not sweepin)
 *** add some additional new helper methods (to 
{{FacetFieldProcessor(ByArray?)}} so we don't have to have these blocks of code 
redundently in 3 places (where {{registerSweepingAccIfSupportedByCollectAcc}} 
is currently being called) ...
{code:java}
/** returns the empty list unless this processor is not using sweeping */
public List<SweepCountAccStruct> getOtherSweepStructs() {
  if (countAcc instanceof SweepingCountSlotAcc) {
    return ((SweepingCountSlotAcc) countAcc).others;
  }
  return Collections.emptyList();
}
/** returns the base sweep structure to use, or a dummy struct if sweeping is 
not used */
public SweepCountAccStruct getBaseSweepStruct() {
  if (countAcc instanceof SweepingCountSlotAcc) {
    return ((SweepingCountSlotAcc) countAcc).base;
  }
  return SweepCountAccStruct(fcontext.base, true, countAcc);
}
{code}

WDYT?

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to