[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Michael Gibney (Jira) Tue, 05 May 2020 20:06:43 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100425#comment-17100425
 ]


Michael Gibney commented on SOLR-13132:
---------------------------------------

Ouch, yeah. I ran into that too and was trying to fix it earlier today. I 
hadn't yet realized that the discrepancy was related to UIF vs. DV ... with 
that clue it should be solved with 74e7ce3f0e65cc8abebcfcf4deef9da1ee73a656 
(just pushed). For sweeping, the doc iterator includes "hits" for docs that 
aren't in the base DocSet; but collection should only happen for docs in the 
base DocSet, for purpose of stats, etc. This was in the initial patch, then for 
some reason I thought it was only relevant to the caching aspect of the initial 
patch (SOLR-13807) and excised it from this patch. Subsequently realized it was 
still relevant here and reintroduced it with 
8a06d4c5d691dbf3e224223d7a3a0c84dcbfb87b, but mistakenly only for ArrayDV. 
Thanks for catching this!

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to