[ https://issues.apache.org/jira/browse/SOLR-13132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17095926#comment-17095926 ]
Chris M. Hostetter commented on SOLR-13132: ------------------------------------------- {quote}Thanks, Hoss! Some initial responses re: some of the nocommit comments from 8fcd6271b6: ... {quote} yeah, sorry for any folks following along on the jira – i thought i posted a comment about this last week: * in an offline email with Michael about setting up a branch to more closely iterate on, he pointed out that the way github PRs are setup I already had permission to push commits to hte branch he was using for his PR * so i started committing some test work and small edits with refactorings/comments/quesions directly to his branch {quote}... my initial inclination is to prefer leaving SweepingAcc as a separate class, because CountSlotAcc currently clearly does one specific thing, ... {quote} Well, except that with the changes we're making, CountSlotAcc is no longer just doing one specific thing (counting) – it's also managing any sweep collectors that exist. I think we need .... something ... to help clean up the API a bit. Not only are the internals of {{CountSlotAcc.getBaseSweepingAcc(...)}} really messy and code-smell-ish right now, but as i discovered when i added in "whitebox" testing that inspects the facet debug output: since {{getBaseSweepingAcc().setValues(...)}} has to be called from {{FacetFieldProcessor.fillBucketFromSlot()}}, that means even non-sweep processors (like "hashdv" for example) wind up initializing {{SweepingAcc}} instances. more concrete thoughts on possible ways to clean up this API / seperation of concerns as part of my next comment below... Re: {{CountSlotAcc implements SweepableSlotAcc<CountSlotAcc>}} {quote}... my thinking was: although countAcc is currently the one and only CountSlotAcc, used to accumulate counts over the base domain DocSet only, there could be cases where extra CountSlotAccs are used more directly (e.g. as part of stats collection, analogous to how they're used indirectly for SKG sweep collection). In such a case, these "non-base" CountSlotAccs would respond as implemented {quote} that's totally fair – if you can envision possible usages for "extra" {{CountSlotAcc}} instances down the road that act as {{SweepableSlotAcc}} instances "hanging off" the "main" {{countAcc}} then I'm fine with that, since it doesn't really hurt anything and it's a "clean" implementation of the API: there's really no reason why all {{CountSlotAcc}} impls can't {{implements SweepableSlotAcc<CountSlotAcc>}}. But taking a step back on the topic of "API cleanliness", i think the direction i'm leaning at the moment is that {{SweepingAcc}} should really just be {{SweepingCountSlotAcc extends CountSlotArrAcc}} ... a new concrete {{CountSlotAcc}} subclass that would be instantiated by processors that support sweeping for use as {{this.countAcc}}, and would inherently & automatically handles all the "sweep" related logic currently spread between {{SweepingAcc}} and {{CountSlotAcc.getBaseSweepingAcc(...)}} (and might eliminate the need for {{FacetFieldProcessorSlotAccMapper implements CollectSlotAccMappingAware}} ... ? ...unless there's another use for {{CollectSlotAccMappingAware}} i'm overlooking?) This would include making sure that {{countAcc.setValues(...)}} "does the right thing" when {{countAcc instanceof SweepingCountSlotAcc}} and there are registered "output" SlotAccs forit to handle... Which brings me back to 2 related earlier comments/questions you had... {quote}I think it's probably better to have collectAcc read access mediation be an "all-or-nothing" thing ... considering for example the MultiAcc case where some SlotAccs might be modified while others aren't, that separating them and calling setValues(...) on the modified and non-modified groups separately would affect the order of output. I think this would only affect the MultiAcc case, since all other SlotAccs in collectAcc would be all-or-nothing by nature (i.e., they'd either register replacement SlotAcc or not). {quote} 1) I wouldn't worry about the order of keys – this is already non-deterministic on master -- not only can changing the "sort" or "prelim_sort" affect the order of the keys currently (due to how the "collectAcc" SlotAcc gets treated distinctly from "otherAccs") but even things like changing "method" or "limit" can cause the keys to come in diff orders (either because a switching to the "enum" method, or because of using MultiAcc when doing a singlePass collection) (This is something i noticed along time ago and forgot about until it started biting me in the ass in the randomized test i committed to your branch – independent of wether sweeping is used ... i have not yet decided yet if it seems better/easier to make the test not care about the order of the keys, or to bite the bullet and enforce a deterministic over in {{FacetProcessor}} in a new jira) {quote}For now, rather than override setValues(...) in CountSlotAcc, I left it for supporting FacetFieldProcessors to explicitly call countAcc.getBaseSweepingAcc().setValues(...). Currently, countAcc values seem to generally be accessed directly via countAcc.getCount(...), rather than via their setValues(...) method – I wasn't sure about switching all those access patterns over – thoughts? {quote} 2) when i first read this comment, i was on the fence and leaning towards leaving hte code alone – but the more i think about it – particularly in the context of good separation of concerns – the more i think cleaning the code up to consistently use {{countAcc.setValues(...)}} so we can override it is a better approach. if the caller also needs to use {{countAcc.getCount(...)}} to decide when to recurse, so be it. > Improve JSON "terms" facet performance when sorted by relatedness > ------------------------------------------------------------------ > > Key: SOLR-13132 > URL: https://issues.apache.org/jira/browse/SOLR-13132 > Project: Solr > Issue Type: Improvement > Components: Facet Module > Affects Versions: 7.4, master (9.0) > Reporter: Michael Gibney > Priority: Major > Attachments: SOLR-13132-with-cache-01.patch, > SOLR-13132-with-cache.patch, SOLR-13132.patch, SOLR-13132_testSweep.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > When sorting buckets by {{relatedness}}, JSON "terms" facet must calculate > {{relatedness}} for every term. > The current implementation uses a standard uninverted approach (either > {{docValues}} or {{UnInvertedField}}) to get facet counts over the domain > base docSet, and then uses that initial pass as a pre-filter for a > second-pass, inverted approach of fetching docSets for each relevant term > (i.e., {{count > minCount}}?) and calculating intersection size of those sets > with the domain base docSet. > Over high-cardinality fields, the overhead of per-term docSet creation and > set intersection operations increases request latency to the point where > relatedness sort may not be usable in practice (for my use case, even after > applying the patch for SOLR-13108, for a field with ~220k unique terms per > core, QTime for high-cardinality domain docSets were, e.g.: cardinality > 1816684=9000ms, cardinality 5032902=18000ms). > The attached patch brings the above example QTimes down to a manageable > ~300ms and ~250ms respectively. The approach calculates uninverted facet > counts over domain base, foreground, and background docSets in parallel in a > single pass. This allows us to take advantage of the efficiencies built into > the standard uninverted {{FacetFieldProcessorByArray[DV|UIF]}}), and avoids > the per-term docSet creation and set intersection overhead. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org