[PR] Compute facets while collecting [lucene]

via GitHub Fri, 12 Jul 2024 05:34:43 -0700


epotyom opened a new pull request, #13568:
URL: https://github.com/apache/lucene/pull/13568


   @Shradha26 and I are working on a new faceting implementation in the sandbox 
module. With this, we are proposing the following new features to Lucene’s 
faceting and aggregation capabilities - 
   
   #### Compute multiple associated values for all field types
   
   For an index with documents that have facet field `color` and numeric field 
`popularity`, today, we can associate the MAX or SUM of `popularity` with the 
facets of color using the `TaxonomyFloatFacetAssociations` implementations. We 
are limited to computing only one associated value in one iteration per 
Faceting implementation. But sometimes, we also want to compute multiple values 
 e.g. SUM(popularity) and MAX(bm25_score). With the new sandbox faceting, we 
have native support for multiple associated values with each facet without 
requiring to iterate through the documents multiple times.
   
   Current faceting supports computing associated values for Taxonomy fields 
only. With the new implementation, users will be able to compute associated 
values for any faceting implementation. 
   
   #### Computing facets during collection
   
   Today `FacetsCollector#collect` only collects docIDs, and facet computations 
happens in each `Facets` instance in its own doc ID loop. Some numeric field 
values may be expensive to compute (e.g. some relevance scores).  In order to 
re-use the values from these fields of documents to compute the different 
associated values for different facets, we have to cache values for all docs, 
which might also be expensive.
   
   The new implementation computes facets as it collects the docIDs. I allows 
us to do the expensive computation just once for a document, and then use this 
value for all the facets the document belongs to.
   
   #### Decouple aggregations computation from the logic where we decide which 
facets a document belongs to
   
   The current faceting implementation is responsible for retrieving the 
different facets associated with a document and storing aggregated values or 
counts across different facets. Due to this, we have different facet 
implementations to handle each type of aggregated values - e.g 
IntTaxonomyFacets, FloatTaxonomyFacets etc. 
   
   In the proposed implementation, we have two new interfaces that help 
decouple the “faceting” and the “aggregation” logic - 
   
   1. FacetCutter - “cuts” a document into facets, yields facet IDs/ordinals 
(int).
   2. FacetRecorder - for given document ID and facet ID, record some data 
(count, long aggregations, double aggregations or anything else depending on 
selected FacetRecorder implementation)
   
   As a result, any facet type (taxonomy, numeric ranges, etc) can aggregate 
any type of values (count only, or any type or number of associated value 
aggregations)
   
   #### Unified, flexible facet results filtering and sorting
   
   Today, `Facets#getTopChildren`, `getAllChildren`, `getTopDims` etc method 
can be used to get facet results. This API has some limitations. One limitation 
is that each Facets implementation has slightly different logic. For example, 
most implementation sort results by count to get topN, but the ones that 
compute associations sort by aggregated value (in descending order). They also 
often have different tie-break algorithms. Some implementations tie break by 
facet ordinal (e.g. `TaxonomyFacets`), some by facet value (e.g. 
`LongValueFacetCounts`), etc. To implement different sort order, clients have 
to extend these classes and override these methods, which is not always 
convenient, e.g. some of these classes are package private.
   
   We have introduced an `OrdinalIterator` interface, which has implementations 
to perform “get children”, “get top N”, “get specific values” operations. 
Ordinals can be sorted by values, collected by FacetRecorder, by labels, etc. 
Clients can extend `OrdinalsIterator` interface to implement custom filtering 
or sort order.
   
   #### DrillSideways supports any collectors
   
   For the sandbox module to work, we had to change DrillSideways to support 
any type of Collector, not just FacetCollector. The challenge there is that 
different drill sideways dimension might want to use different Collector types 
and return different result types (`CollectorManager<C, T>`), and Java generics 
don't make handling of unknown number of generic types very easy. In the PR it 
is solved by adding `CollectorOwner` class which keeps collectors that 
CollectorManager creates. As a result, DrillSideways doesn't have to manage 
Collectors and results itself, which allows us to use wildcard type `?` and let 
client handle collector and result types.
   
   #### Latency Improvement
   
   Computing facets during collection is more expensive, because we need to 
collect in each searcher slice, and then merge results in 
`CollectorManager#reduce`. At the same time, it reduces latency, because 
initial collection is done in parallel and there is an opportunity to reuse 
values computed for a document and use for the doc for all the different 
collectors (drillsideways).
   
   Please see `SandboxFacetsExample` in the `demo` package for some usage 
examples.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Compute facets while collecting [lucene]

Reply via email to