[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Chris M. Hostetter (Jira) Wed, 24 Jun 2020 17:21:10 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144550#comment-17144550
 ]


Chris M. Hostetter commented on SOLR-13132:
-------------------------------------------

I pushed a few commits to your branch containing some small cleanups/tweaks i 
noticed while reviewing the code, and making the changes i suggested in my last 
comment regarding registerSweepingAccIfSupportedByCollectAcc.

(Please let me know what you think of these and if you have any concerns)

I'n generally I feel pretty good about the branch in it's current state, i 
think there are really just 2 outstanding questions:
 * the "what is the allBucketSlotNum when sweeping" problem
 ** i polished up your existing approach to make it a little more robust and 
future proof
 ** this approach has grown on me and I think the trade off of how "hackish" it 
feels is appropriate given the esoteric-ness of the situation and the "cost" of 
revamping various APIs to solve it any differently
 *** if we relatedness() was more meaningful in the context of the allBuckets 
bucket it might be a differnet story
 * filterCaching of relatedness() TermQueries in the "non-sweep" situation
 ** as i mentioned before, i really don't think we should be lumping this 
change in with adding sweeping...
 *** 
https://issues.apache.org/jira/browse/SOLR-13132?focusedCommentId=17105821&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17105821
 ** i still believe we should remove these changes from this branch, an revisit 
this as an independent change in SOLR-13108.
 *** particularly as a hedge against hte risk that the sweeping changes 
introduce some bug we haven't thought of: people can always work around by 
setting {{sweep_collection: false}} to bypass and get the existing behavior, 
but if we _also_ break the existing behavior via a caching change... ugh.
 *** the fact that you still have concerns about the approach being taken, and 
questions about wether using DocsEnumState here would work (i haven't thought 
about it) just solidifies that opinion – let's not let the sweeping changes get 
held up / bogged down any further by questions of caching i nthe non-sweeping 
code paths.

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to