[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Michael Gibney (Jira) Thu, 23 Apr 2020 13:50:11 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17090931#comment-17090931
 ]


Michael Gibney commented on SOLR-13132:
---------------------------------------

Thanks, Hoss! Some initial responses re: some of the nocommit comments from 
[8fcd6271b6|https://github.com/magibney/lucene-solr/commit/8fcd6271b684da589ddae8e4319b564249ee76cb]:

{code:java}
    // nocommit: for that matter: can we eliminate SweepingAcc as a class,
    // nocommit: and just roll that specific logic into CountSlotAcc?
    // nocommit: IIUC: there should only ever be a single SweepingAcc instance,
    // nocommit: and callers should never use/instantiate a SweepingAcc w/o 
using 'countAcc' ... correct?
{code}
That would definitely work. However, my initial inclination is to prefer 
leaving {{SweepingAcc}} as a separate class, because {{CountSlotAcc}} currently 
clearly does one specific thing, and folding the {{SweepingAcc}} functionality 
(which could be relatively complex -- potentially deduping {{DocSets}}, etc...) 
would mix in a different type of functionality that's only relevant in some of 
the contexts where {{CountSlotAcc}} is currently used. Accessing 
{{SweepingAcc}} via {{countAcc.getBaseSweepingAcc()}} strikes a balance of 
using {{countAcc}} as a coordination point for related but distinct 
functionality ... perhaps a cleaner separation of concerns?

{code:java}
// nocommit: since 'countAcc' is now the special place all sweeping is tracked, 
it seems
// nocommit: unneccessary (and uneccessarly confusing) for it to also be a 
'SweepableSlotAcc'
// nocommit: any reason not to just remove this?
abstract class CountSlotAcc extends SlotAcc implements ReadOnlyCountSlotAcc /*, 
SweepableSlotAcc<CountSlotAcc> ... nocommit... */ {
...
  // nocommit: CountSlotAcc no longer implements SweepableSlotAcc...
  // @Override
  // public CountSlotAcc registerSweepingAccs(SweepingAcc baseSweepingAcc) {
  //   baseSweepingAcc.add(new SweepCountAccStruct(fcontext.base, false, this, 
this));
  //   baseSweepingAcc.registerMapping(this, this);
  //   return null;
  // }
{code}
True, I'm glad you mentioned this. I left this in partly to illustrate another 
concrete case (aside from SKG) for which sweep collection might be useful. In 
its current state it admittedly seems a bit contrived, but my thinking was: 
although {{countAcc}} is currently the one and only {{CountSlotAcc}}, used to 
accumulate counts over the base domain {{DocSet}} only, there could be cases 
where extra {{CountSlotAccs}} are used more directly (e.g. as part of stats 
collection, analogous to how they're used indirectly for SKG sweep collection). 
In such a case, these "non-base" {{CountSlotAccs}} would respond as implemented 
in the above {{registerSweepingAccs(...)}} method. More practically speaking, 
it also occurred to me that one promising use of sweep collection would be to 
accumulate counts over all subfacet domains in a single sweep (for 
nested/sub-facets, pivot facets, what-have-you) -- not sure if that would be 
directly accommodated by the current incarnation of the "sweeping", but it 
might be a use case to consider. With all that said, I'm not at all opposed to 
removing the {{SweepableSlotAcc}} interface from {{CountSlotAcc}}; it should 
anyway be straightforward to add back in later should the need arise.

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to