epotyom opened a new pull request, #13568: URL: https://github.com/apache/lucene/pull/13568
@Shradha26 and I are working on a new faceting implementation in the sandbox module. With this, we are proposing the following new features to Lucene’s faceting and aggregation capabilities - #### Compute multiple associated values for all field types For an index with documents that have facet field `color` and numeric field `popularity`, today, we can associate the MAX or SUM of `popularity` with the facets of color using the `TaxonomyFloatFacetAssociations` implementations. We are limited to computing only one associated value in one iteration per Faceting implementation. But sometimes, we also want to compute multiple values e.g. SUM(popularity) and MAX(bm25_score). With the new sandbox faceting, we have native support for multiple associated values with each facet without requiring to iterate through the documents multiple times. Current faceting supports computing associated values for Taxonomy fields only. With the new implementation, users will be able to compute associated values for any faceting implementation. #### Computing facets during collection Today `FacetsCollector#collect` only collects docIDs, and facet computations happens in each `Facets` instance in its own doc ID loop. Some numeric field values may be expensive to compute (e.g. some relevance scores). In order to re-use the values from these fields of documents to compute the different associated values for different facets, we have to cache values for all docs, which might also be expensive. The new implementation computes facets as it collects the docIDs. I allows us to do the expensive computation just once for a document, and then use this value for all the facets the document belongs to. #### Decouple aggregations computation from the logic where we decide which facets a document belongs to The current faceting implementation is responsible for retrieving the different facets associated with a document and storing aggregated values or counts across different facets. Due to this, we have different facet implementations to handle each type of aggregated values - e.g IntTaxonomyFacets, FloatTaxonomyFacets etc. In the proposed implementation, we have two new interfaces that help decouple the “faceting” and the “aggregation” logic - 1. FacetCutter - “cuts” a document into facets, yields facet IDs/ordinals (int). 2. FacetRecorder - for given document ID and facet ID, record some data (count, long aggregations, double aggregations or anything else depending on selected FacetRecorder implementation) As a result, any facet type (taxonomy, numeric ranges, etc) can aggregate any type of values (count only, or any type or number of associated value aggregations) #### Unified, flexible facet results filtering and sorting Today, `Facets#getTopChildren`, `getAllChildren`, `getTopDims` etc method can be used to get facet results. This API has some limitations. One limitation is that each Facets implementation has slightly different logic. For example, most implementation sort results by count to get topN, but the ones that compute associations sort by aggregated value (in descending order). They also often have different tie-break algorithms. Some implementations tie break by facet ordinal (e.g. `TaxonomyFacets`), some by facet value (e.g. `LongValueFacetCounts`), etc. To implement different sort order, clients have to extend these classes and override these methods, which is not always convenient, e.g. some of these classes are package private. We have introduced an `OrdinalIterator` interface, which has implementations to perform “get children”, “get top N”, “get specific values” operations. Ordinals can be sorted by values, collected by FacetRecorder, by labels, etc. Clients can extend `OrdinalsIterator` interface to implement custom filtering or sort order. #### DrillSideways supports any collectors For the sandbox module to work, we had to change DrillSideways to support any type of Collector, not just FacetCollector. The challenge there is that different drill sideways dimension might want to use different Collector types and return different result types (`CollectorManager<C, T>`), and Java generics don't make handling of unknown number of generic types very easy. In the PR it is solved by adding `CollectorOwner` class which keeps collectors that CollectorManager creates. As a result, DrillSideways doesn't have to manage Collectors and results itself, which allows us to use wildcard type `?` and let client handle collector and result types. #### Latency Improvement Computing facets during collection is more expensive, because we need to collect in each searcher slice, and then merge results in `CollectorManager#reduce`. At the same time, it reduces latency, because initial collection is done in parallel and there is an opportunity to reuse values computed for a document and use for the doc for all the different collectors (drillsideways). Please see `SandboxFacetsExample` in the `demo` package for some usage examples. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org