[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Chris M. Hostetter (Jira) Tue, 05 May 2020 16:21:43 -0700


    [ 
https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100316#comment-17100316
 ]


Chris M. Hostetter commented on SOLR-13132:
-------------------------------------------

[~mgibney]- i haven't had a chance to review 
6338b327d30d0c1d5fdcb8168baf8398b02787d4 in depth, but at first glance I'm a 
fan of what i see.

your MultiAcc fix in 9ab4baef4a95a90e080d9118608619855f2e2759 seems to have 
addressed both of the failures I mentioned before, but it trying to beef up the 
MultiAcc testing i uncovered a new type of failure ...i don't really have any 
ideas (or even guesses) as to waht exactly is the source of the problem, but 
what i'm seeing is that in the MultiAcc situation _non-sweeping_ stats don't 
seem to be getting merged properly in some cases (even w/o refinement)

...BUT...

the discrepency doesn't actually manifest when comparing "sweep vs non-sweep" – 
it happens when comparing the "default" behavior for faceting on a multivalued 
string field (which should be using ArrayUIF for these fields, and implicitly 
sweeping) compared to explicitly using ArrayDV w/sweeping ... the buckets, 
counts, "skg" results, and bucket orderings are all consistent regardless of 
sort, but we get different values for the non-sweeping "min" stat that's used 
in the same facets (or in some cases, the "min" stat is completely missing)

You can see this when testing against the 
ec5f3a451a0be1a07a9ca37bf2cce33b8548b245 i just pushed to your branch with 
something like...
{noformat}
ant test  -Dtestcase=TestCloudJSONFacetSKGSweep  
-Dtests.seed=838E28E3EBC3B3B3:66DB4D3457DDE2F7 -Dtests.slow=true 
-Dtests.badapples=true -Dtests.locale=rof-TZ -Dtests.timezone=Etc/GMT+3 
-Dtests.asserts=true -Dtests.file.encoding=UTF-8
{noformat}
----
Here's how the new testBespokeStructures i just added fails with that seed 
combination...
{noformat}
   [junit4]    > Throwable #1: java.lang.AssertionError: 
rows=0&q=(field_11_multi_sdsS:55+OR+field_0_multi_ss:46)&fore=field_5_multi_sdsS:9&back=*:*&json.facet={xxx1+:+{+type:terms,+method:${method_val:smart},+field:field_0_multi_ss,+limit:-1,+overrequest:0,+sort:+'count+desc',+refine:+false,+facet:{skg+:+{+"type":+"func",+++"func":+"relatedness($fore,$back)",+++"min_popularity":+0.001,+++${sweep_key:xxx}:+${sweep_val:yyy}+}+,min+:+"min(field_3_solo_i)"+}}+}
 ===> Mismatch: .xxx1.buckets[2][min]:32!=40 using 
method_val=dv&sweep_key=sweep_collection&sweep_val=true
   [junit4]    >        at 
__randomizedtesting.SeedInfo.seed([838E28E3EBC3B3B3:66DB4D3457DDE2F7]:0)
   [junit4]    >        at 
org.apache.solr.search.facet.TestCloudJSONFacetSKGSweep.assertFacetSKGsAreConsistent(TestCloudJSONFacetSKGSweep.java:635)
   [junit4]    >        at 
org.apache.solr.search.facet.TestCloudJSONFacetSKGSweep.testBespokeStructures(TestCloudJSONFacetSKGSweep.java:513)
{noformat}
And here's the first bit of the "expected" (ie: ArrayUIF i believe) vs "actual" 
(ArrayDV) output, reformated for easier reading...
{noformat}
expected = {count=11, xxx1={buckets=[
  {val=39, count=3, 
    skg={relatedness=0.0063, foreground_popularity=0.00833, 
background_popularity=0.08333}, 
    min=10}, 
  {val=46, count=3, 
    skg={relatedness=-Infinity, foreground_popularity=0.0, 
background_popularity=0.025}, 
    min=37}, 
  {val=10, count=2, 
    skg={relatedness=-Infinity, foreground_popularity=0.0, 
background_popularity=0.04167}, 
    min=32},
...

actual = {count=11, xxx1={buckets=[
  {val=39, count=3, 
    skg={relatedness=0.0063, foreground_popularity=0.00833, 
background_popularity=0.08333}, 
    min=10}, 
  {val=46, count=3, 
    skg={relatedness=-Infinity, foreground_popularity=0.0, 
background_popularity=0.025}, 
    min=37}, 
  {val=10, count=2, 
    skg={relatedness=-Infinity, foreground_popularity=0.0, 
background_popularity=0.04167}, 
    min=40}, 
...
{noformat}
----
testBespoke w/same seed shows a "null" value for min for the first buckets - 
but other buckets have inconsistent min values ...
{noformat}
   [junit4]    > Throwable #1: java.lang.AssertionError: 
rows=0&q=(field_13_multi_sds:26+OR+field_6_multi_ss:33+OR+field_9_multi_ss:24)&fore=(field_4_multi_sds:27+OR+field_12_multi_ss:18+OR+field_2_multi_sdsS:28+OR+field_13_multi_sds:50)&back=*:*&json.facet={xxx+:+{+type:terms,+method:${method_val:smart},+field:field_12_multi_ss,+limit:-1,+overrequest:0,+sort:+'count+asc',+refine:+false,+facet:{skg+:+{+"type":+"func",+++"func":+"relatedness($fore,$back)",+++"min_popularity":+0.001,+++${sweep_key:xxx}:+${sweep_val:yyy}+}+,min+:+"min(field_4_multi_ids)"+}}+}&_stateVer_=org.apache.solr.search.facet.TestCloudJSONFacetSKGSweep_collection:4
 ===> Mismatch: .xxx.buckets[0][min]==null using 
method_val=dv&sweep_key=sweep_collection&sweep_val=true

....


expected = {count=23, xxx={buckets=[
  {val=17, count=1, 
    skg={relatedness=-0.00387, foreground_popularity=0.00833, 
background_popularity=0.05833}, 
    min=16}, 
  {val=19, count=1, 
    skg={relatedness=-0.00387, foreground_popularity=0.00833, 
background_popularity=0.05833}, 
    min=5}, 
  {val=22, count=1, 
    skg={relatedness=-Infinity, foreground_popularity=0.0, 
background_popularity=0.04167},
    min=5}, 
  {val=23, count=1, 
    skg={relatedness=0.0112, foreground_popularity=0.01667, 
background_popularity=0.04167}, 
    min=16}, 
  {val=25, count=1, 
    skg={relatedness=0.00579, foreground_popularity=0.00833, 
background_popularity=0.025}, 
    min=7}, 
  {val=30, count=1, 
    skg={relatedness=-0.00387, foreground_popularity=0.00833, 
background_popularity=0.05833}, 
    min=9}, 
  {val=31, count=1, 
    skg={relatedness=0.01517, foreground_popularity=0.025, 
background_popularity=0.05833}, 
    min=5},
...

actual = {count=23, xxx={buckets=[
  {val=17, count=1, 
    skg={relatedness=-0.00387, foreground_popularity=0.00833, 
background_popularity=0.05833}
    }, 
  {val=19, count=1, 
    skg={relatedness=-0.00387, foreground_popularity=0.00833, 
background_popularity=0.05833}, 
    min=10}, 
  {val=22, count=1, 
    skg={relatedness=-Infinity, foreground_popularity=0.0, 
background_popularity=0.04167}, 
    min=12},
  {val=23, count=1, 
    skg={relatedness=0.0112, foreground_popularity=0.01667, 
background_popularity=0.04167}, 
    min=42}, 
  {val=25, count=1, 
    skg={relatedness=0.00579, foreground_popularity=0.00833, 
background_popularity=0.025}, 
    min=7}, 
  {val=30, count=1, 
    skg={relatedness=-0.00387, foreground_popularity=0.00833, 
background_popularity=0.05833}, 
    min=30}, 
  {val=31, count=1, 
    skg={relatedness=0.01517, foreground_popularity=0.025, 
background_popularity=0.05833}, 
    min=7}, 
...
{noformat}
 

> Improve JSON "terms" facet performance when sorted by relatedness 
> ------------------------------------------------------------------
>
>                 Key: SOLR-13132
>                 URL: https://issues.apache.org/jira/browse/SOLR-13132
>             Project: Solr
>          Issue Type: Improvement
>          Components: Facet Module
>    Affects Versions: 7.4, master (9.0)
>            Reporter: Michael Gibney
>            Priority: Major
>         Attachments: SOLR-13132-with-cache-01.patch, 
> SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch
>
>          Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate 
> {{relatedness}} for every term. 
> The current implementation uses a standard uninverted approach (either 
> {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain 
> base docSet, and then uses that initial pass as a pre-filter for a 
> second-pass, inverted approach of fetching docSets for each relevant term 
> (i.e., {{count > minCount}}?) and calculating intersection size of those sets 
> with the domain base docSet.
> Over high-cardinality fields, the overhead of per-term docSet creation and 
> set intersection operations increases request latency to the point where 
> relatedness sort may not be usable in practice (for my use case, even after 
> applying the patch for SOLR-13108, for a field with ~220k unique terms per 
> core, QTime for high-cardinality domain docSets were, e.g.: cardinality 
> 1816684=9000ms, cardinality 5032902=18000ms).
> The attached patch brings the above example QTimes down to a manageable 
> ~300ms and ~250ms respectively. The approach calculates uninverted facet 
> counts over domain base, foreground, and background docSets in parallel in a 
> single pass. This allows us to take advantage of the efficiencies built into 
> the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids 
> the per-term docSet creation and set intersection overhead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-13132) Improve JSON "terms" facet performance when sorted by relatedness

Reply via email to