Shradha26 opened a new issue, #12553: URL: https://github.com/apache/lucene/issues/12553
I’d like to gather a list of areas where Lucene’s support for aggregations can be improved and discuss if faceting can be augmented to offer that support or if it would need to be separate functionality. Please suggest more ideas or challenge the ones listed! ## Description Information Retrieval platforms built on top of Lucene like Solr, Elastic Search, and OpenSearch have rich aggregation engines that are different from what Lucene natively supports. Lucene has some unique ideas to make aggregation computation efficient. Some examples are - * side car taxonomy index that does some processing at index time to make search time computation of aggregations faster, but adds the hassle of maintaining another index, risk of it being corrupt, etc. * rolling up values from children to parents (post aggregation) to compute aggregations in some cases - reducing the overall number of values computed per aggregation. Here are some ideas [@stefanvodita](https://github.com/stefanvodita) and I encountered in our work and through exploration of what the other platforms support - #### New features * Support dynamic group definitions: In Solr, defining an aggregation group is as simple as providing an iterator over documents. In Lucene, we can’t make arbitrary group definitions to that extent. * Support adding data for groups (ordinals). Association Facet Fields can do this; but one problem is that there’s no single authority for the data. For example, if we have an index of books where the Author is a Facet Field on the book document and we want to store the Author’s popularity, with Association Facet Fields, we’ll need to denormalize this value once per each book document. This introduces the possibility of some document changing/overwriting the intended value, inconsistently. For Taxonomy Facets, we could use the side car taxonomy index to efficiently add data about aggregation groups in normalized form: https://github.com/apache/lucene/pull/12337 * Aggregation Expression Facets: Users may want to define expressions that reference other aggregation functions: https://github.com/apache/lucene/pull/12184. Note that this is different from the “Expression Facets” in the current Lucene demos - those are “document expression facets” and are expressions defined using fields on the document and do not use aggregations in the definition. * Nested aggregations: This is similar to the idea of Expression Facets in that they are an aggregation over other aggregations, but this time the parent aggregation references aggregations from lower levels in a hierarchy. For example: if we have an index of books with a hierarchical Facet Field for Author like <Nationality>/<Author>, we want to be able to answer queries like “What is the nationality of the author with the most books”? To do this, we’ll need to compute an aggregation A1 = count(books) per Author (level 1 in the hierarchy) and then do a max(A1) per Nationality (level 2 in the hierarchy). * [Cascading aggregation groups](https://github.com/apache/lucene/issues/4195): I think this feature corresponds to nested facets in Solr. A clothing store could have products categorized by size and color, as different dimensions. They might want to give customers a navigation experience that breaks down sizes by color. The customer might like blue, but see that the store is in short supply of blue items of the right size, and select a different color instead. * API to associate aggregation groups with the aggregation functions: For single valued hierarchical facet fields, Lucene uses roll-ups to compute aggregation across the different levels. However, for all other cases, it uses an approach I’ll refer to as jump-ups - for each document, we iterate through all possible unique groups (ordinals) relevant to the document (irrespective of their relationship via a hierarchy), and update an aggregation accumulator corresponding to the group (ordinals). (This is the values array indexed by ordinal in TaxonomyFacet implementations). Right now, even the jump-up approach ends up computing the aggregation function across all groups. However, it has the potential to compute exactly the number of aggregations needed. By associating aggregation groups with the aggregation functions users are be interested in, Lucene can compute exact number of aggregations. It could also be beneficial in cases where users are interested in a variety of aggregatio ns for different groups. * Enable deep hierarchies: Very deep ordinal hierarchies can have trouble scaling. Storing every ordinal in the path up to a label on a doc may not be feasible. Computing aggregations would be slower than it has to be if they are only needed for portions of the hierarchy. Short-circuiting roll-ups or aggregating directly into targeted ordinals would improve performance (see idea above). * Make facets generic: Some implementations depend on underlying primitives, which makes them efficient, but it also makes it so any improvement to IntTaxonomyFacets has to be reimplemented for FloatTaxonomyFacets. Also, if a user wants to define a new aggregation (e.g. AVG), they have to write a new faceting implementation too, since the existing facets don’t have generic accumulators. An AVG accumulator would need to maintain two values. The existing accumulators can only store one value per ordinal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org