[ 
https://issues.apache.org/jira/browse/SOLR-15079?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris M. Hostetter updated SOLR-15079:
--------------------------------------
    Attachment: SOLR-15079.patch
        Status: Open  (was: Open)

The main difference between this new approach and the existing collapse 
approach is that the existing collapse PostFilter
 maintains a big in memory data structure of every "group key" (values from the 
collapse field) it sees in the matching docs, and the "best" matching doc of 
each group (ie: the current "group head" - along with the selector values 
corresponding to each of those group head docs that are needed to determine if 
they are better/worse then any other candidate doc for that group that might 
come alone (this might be the 'score' of each doc w/default collapsing, or some 
field values if one of the min/max/sort group head selectors are used). Once 
the PostFilter is done collecting all matching docs, then it does another pass 
over these data structures to delegate collection of just the (final) best 
"group heads"

In the new logic, since we know our grouping field is unique per "block" of 
indexed documents, then no large in memory data structures are needed to track 
_all_ groups at once – we can simply record the single best doc / group head 
selector values for the _current_ group, and once we encounter a doc with a new 
value in the collapse field (ie: a new "group key"), we can immedaitely 
delegate collection of the "previous" group's best matching doc, and throw away 
it's metadata.

This means the new impl uses a *LOT* less ram then the old impl.
----
I did some benchmarking using an index built from some ecommerce style data 
containing ~50,000 (Parent) Products, ~8.5 Million (Child) SKUs in collections 
that had 6 shards, 1 replica each, with each replica hosted on it's own Solr 
node. test clients issued randomized queries designed to match different 
permutations of docs, w/varying number o matches per group.
 * Long running query tests against the collection built using nested docs and 
using block collapse had (cumulative) query times of ~ 45% to 65% lower then a 
"typical" collection*
 ** the relative perf gains of the new impl were higher as the query load (ie: 
num concurrent clients) increased
 ** the relative perf gains were consistent regardless of how many docs matched 
the test query, how many unique groups those docs were in, or how many docs in 
those groups were matched by those queries
 ** there was some notable diff in relative perf based on the number of 
segments – but that was because the existing impl does significantly better 
when there are fewer segments (probably due to ordinal mapping?) while the new 
impl has largely consistent behavior regardless of the number of segments
 * A lot of the "overall gains" probably come from reduced GC/memory contention 
(which system monitoring demonstrated was notely reduced with the new impl), 
but even in micro load testing the new implementation is faster on individual 
requests – which makes sense because it only has to do a single pass over the 
matching documents (as opposed to the "one pass over matching docs + one pass 
over matching groups to sort the group head doc ids + one pass over the final 
docids"
 ** so the more unique groups matched by a query, the faster the new impl is 
(relatively speaking) compared to the existing impl

----
The attached patch includes this new logic/approach and uses it by default when 
the collapse field is {{_root_}} but it also supports a new {{hint=block}} 
option users can specify if they want this logic for other fields when they 
know their groups are co-located. This is necessary if you have "deeply nested" 
documents and you want group on something that isn't consistent for all 
descendants of the same {{_root_}} doc, but is consistent for all descendants 
of particular ancestor docs.

Example: each root (level-0) product doc may have multiple (level-1) SKU 
"child" docs, and each SKU doc may have it's own (level-2) "variant" child docs 
(ie: grand child of 'product') that include a "sku_s" field which is guaranteed 
to consistent in every "variant" doc (and guaranteed to be unique across all 
unique SKU level documents). You could use {{"hint=block field=sku_s"}} when 
searching against variant docs to collapse down to the "best" variant for each 
sku.x

NOTE: This approach is only valid for {{nullPolicy=expand}} or 
{{nullPolicy=ignore}} (the default). It would not be possible to implement 
{{nullPolicy=collapse}} with this type of "one pass" approach.

I feel like the current patch is really solid and ready to commit & backport to 
8x, but I welcome any questions/concerns.

> Block Collapse (faster collapse code when groups are co-located via Block 
> Join style nested doc indexing)
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SOLR-15079
>                 URL: https://issues.apache.org/jira/browse/SOLR-15079
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Chris M. Hostetter
>            Assignee: Chris M. Hostetter
>            Priority: Major
>         Attachments: SOLR-15079.patch
>
>
> A while back, JoelB had an idea for an optimized version of of the logic in 
> the CollapsingQParserPlugin to take advantage of collapsing on fields where 
> the user could knows that every doc with the same collapseKey were contiguous 
> in the index - for example collapsing on the {{_root_}} field.
> Joel whipped up an initial PoC patch internally at lucidworks that only dealt 
> with some limited cases (string field collapsing w/o any nulls, using default 
> group head selection) to explain the idea, but other priorities prevented him 
> from doing thorough benchmarking or flesh it out into "production ready" code.
> I took Joel's original PoC and fleshed it out with unit tests, fixed some 
> bugs, and did some benchmarking against large indexes - the results look 
> really good.
> I've since then beefed the code up more to include collapsing on numeric 
> fields, and added support for all group head selector types, as well as 
> adding support for {{nullPolicy=expand}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to