[ https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17100316#comment-17100316 ]
Chris M. Hostetter commented on SOLR-13132: ------------------------------------------- [~mgibney]- i haven't had a chance to review 6338b327d30d0c1d5fdcb8168baf8398b02787d4 in depth, but at first glance I'm a fan of what i see. your MultiAcc fix in 9ab4baef4a95a90e080d9118608619855f2e2759 seems to have addressed both of the failures I mentioned before, but it trying to beef up the MultiAcc testing i uncovered a new type of failure ...i don't really have any ideas (or even guesses) as to waht exactly is the source of the problem, but what i'm seeing is that in the MultiAcc situation _non-sweeping_ stats don't seem to be getting merged properly in some cases (even w/o refinement) ...BUT... the discrepency doesn't actually manifest when comparing "sweep vs non-sweep" – it happens when comparing the "default" behavior for faceting on a multivalued string field (which should be using ArrayUIF for these fields, and implicitly sweeping) compared to explicitly using ArrayDV w/sweeping ... the buckets, counts, "skg" results, and bucket orderings are all consistent regardless of sort, but we get different values for the non-sweeping "min" stat that's used in the same facets (or in some cases, the "min" stat is completely missing) You can see this when testing against the ec5f3a451a0be1a07a9ca37bf2cce33b8548b245 i just pushed to your branch with something like... {noformat} ant test -Dtestcase=TestCloudJSONFacetSKGSweep -Dtests.seed=838E28E3EBC3B3B3:66DB4D3457DDE2F7 -Dtests.slow=true -Dtests.badapples=true -Dtests.locale=rof-TZ -Dtests.timezone=Etc/GMT+3 -Dtests.asserts=true -Dtests.file.encoding=UTF-8 {noformat} ---- Here's how the new testBespokeStructures i just added fails with that seed combination... {noformat} [junit4] > Throwable #1: java.lang.AssertionError: rows=0&q=(field_11_multi_sdsS:55+OR+field_0_multi_ss:46)&fore=field_5_multi_sdsS:9&back=*:*&json.facet={xxx1+:+{+type:terms,+method:${method_val:smart},+field:field_0_multi_ss,+limit:-1,+overrequest:0,+sort:+'count+desc',+refine:+false,+facet:{skg+:+{+"type":+"func",+++"func":+"relatedness($fore,$back)",+++"min_popularity":+0.001,+++${sweep_key:xxx}:+${sweep_val:yyy}+}+,min+:+"min(field_3_solo_i)"+}}+} ===> Mismatch: .xxx1.buckets[2][min]:32!=40 using method_val=dv&sweep_key=sweep_collection&sweep_val=true [junit4] > at __randomizedtesting.SeedInfo.seed([838E28E3EBC3B3B3:66DB4D3457DDE2F7]:0) [junit4] > at org.apache.solr.search.facet.TestCloudJSONFacetSKGSweep.assertFacetSKGsAreConsistent(TestCloudJSONFacetSKGSweep.java:635) [junit4] > at org.apache.solr.search.facet.TestCloudJSONFacetSKGSweep.testBespokeStructures(TestCloudJSONFacetSKGSweep.java:513) {noformat} And here's the first bit of the "expected" (ie: ArrayUIF i believe) vs "actual" (ArrayDV) output, reformated for easier reading... {noformat} expected = {count=11, xxx1={buckets=[ {val=39, count=3, skg={relatedness=0.0063, foreground_popularity=0.00833, background_popularity=0.08333}, min=10}, {val=46, count=3, skg={relatedness=-Infinity, foreground_popularity=0.0, background_popularity=0.025}, min=37}, {val=10, count=2, skg={relatedness=-Infinity, foreground_popularity=0.0, background_popularity=0.04167}, min=32}, ... actual = {count=11, xxx1={buckets=[ {val=39, count=3, skg={relatedness=0.0063, foreground_popularity=0.00833, background_popularity=0.08333}, min=10}, {val=46, count=3, skg={relatedness=-Infinity, foreground_popularity=0.0, background_popularity=0.025}, min=37}, {val=10, count=2, skg={relatedness=-Infinity, foreground_popularity=0.0, background_popularity=0.04167}, min=40}, ... {noformat} ---- testBespoke w/same seed shows a "null" value for min for the first buckets - but other buckets have inconsistent min values ... {noformat} [junit4] > Throwable #1: java.lang.AssertionError: rows=0&q=(field_13_multi_sds:26+OR+field_6_multi_ss:33+OR+field_9_multi_ss:24)&fore=(field_4_multi_sds:27+OR+field_12_multi_ss:18+OR+field_2_multi_sdsS:28+OR+field_13_multi_sds:50)&back=*:*&json.facet={xxx+:+{+type:terms,+method:${method_val:smart},+field:field_12_multi_ss,+limit:-1,+overrequest:0,+sort:+'count+asc',+refine:+false,+facet:{skg+:+{+"type":+"func",+++"func":+"relatedness($fore,$back)",+++"min_popularity":+0.001,+++${sweep_key:xxx}:+${sweep_val:yyy}+}+,min+:+"min(field_4_multi_ids)"+}}+}&_stateVer_=org.apache.solr.search.facet.TestCloudJSONFacetSKGSweep_collection:4 ===> Mismatch: .xxx.buckets[0][min]==null using method_val=dv&sweep_key=sweep_collection&sweep_val=true .... expected = {count=23, xxx={buckets=[ {val=17, count=1, skg={relatedness=-0.00387, foreground_popularity=0.00833, background_popularity=0.05833}, min=16}, {val=19, count=1, skg={relatedness=-0.00387, foreground_popularity=0.00833, background_popularity=0.05833}, min=5}, {val=22, count=1, skg={relatedness=-Infinity, foreground_popularity=0.0, background_popularity=0.04167}, min=5}, {val=23, count=1, skg={relatedness=0.0112, foreground_popularity=0.01667, background_popularity=0.04167}, min=16}, {val=25, count=1, skg={relatedness=0.00579, foreground_popularity=0.00833, background_popularity=0.025}, min=7}, {val=30, count=1, skg={relatedness=-0.00387, foreground_popularity=0.00833, background_popularity=0.05833}, min=9}, {val=31, count=1, skg={relatedness=0.01517, foreground_popularity=0.025, background_popularity=0.05833}, min=5}, ... actual = {count=23, xxx={buckets=[ {val=17, count=1, skg={relatedness=-0.00387, foreground_popularity=0.00833, background_popularity=0.05833} }, {val=19, count=1, skg={relatedness=-0.00387, foreground_popularity=0.00833, background_popularity=0.05833}, min=10}, {val=22, count=1, skg={relatedness=-Infinity, foreground_popularity=0.0, background_popularity=0.04167}, min=12}, {val=23, count=1, skg={relatedness=0.0112, foreground_popularity=0.01667, background_popularity=0.04167}, min=42}, {val=25, count=1, skg={relatedness=0.00579, foreground_popularity=0.00833, background_popularity=0.025}, min=7}, {val=30, count=1, skg={relatedness=-0.00387, foreground_popularity=0.00833, background_popularity=0.05833}, min=30}, {val=31, count=1, skg={relatedness=0.01517, foreground_popularity=0.025, background_popularity=0.05833}, min=7}, ... {noformat} > Improve JSON "terms" facet performance when sorted by relatedness > ------------------------------------------------------------------ > > Key: SOLR-13132 > URL: https://issues.apache.org/jira/browse/SOLR-13132 > Project: Solr > Issue Type: Improvement > Components: Facet Module > Affects Versions: 7.4, master (9.0) > Reporter: Michael Gibney > Priority: Major > Attachments: SOLR-13132-with-cache-01.patch, > SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate > {{relatedness}} for every term. > The current implementation uses a standard uninverted approach (either > {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain > base docSet, and then uses that initial pass as a pre-filter for a > second-pass, inverted approach of fetching docSets for each relevant term > (i.e., {{count > minCount}}?) and calculating intersection size of those sets > with the domain base docSet. > Over high-cardinality fields, the overhead of per-term docSet creation and > set intersection operations increases request latency to the point where > relatedness sort may not be usable in practice (for my use case, even after > applying the patch for SOLR-13108, for a field with ~220k unique terms per > core, QTime for high-cardinality domain docSets were, e.g.: cardinality > 1816684=9000ms, cardinality 5032902=18000ms). > The attached patch brings the above example QTimes down to a manageable > ~300ms and ~250ms respectively. The approach calculates uninverted facet > counts over domain base, foreground, and background docSets in parallel in a > single pass. This allows us to take advantage of the efficiencies built into > the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids > the per-term docSet creation and set intersection overhead. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org