[GitHub] [lucene] Shradha26 opened a new issue, #12553: [DISCUSS] Identifying Gaps in Lucene’s Faceting

via GitHub Wed, 13 Sep 2023 08:58:28 -0700


Shradha26 opened a new issue, #12553:
URL: https://github.com/apache/lucene/issues/12553


   I’d like to gather a list of areas where Lucene’s support for aggregations 
can be improved and discuss if faceting can be augmented to offer that support 
or if it would need to be separate functionality. Please suggest more ideas or 
challenge the ones listed!
   
   ## Description
   
   Information Retrieval platforms built on top of Lucene like Solr, Elastic 
Search, and OpenSearch have rich aggregation engines that are different from 
what Lucene natively supports. Lucene has some unique ideas to make aggregation 
computation efficient. Some examples are -
   
   * side car taxonomy index that does some processing at index time to make 
search time computation of aggregations faster, but adds the hassle of 
maintaining another index, risk of it being corrupt, etc.
   * rolling up values from children to parents (post aggregation) to compute 
aggregations in some cases - reducing the overall number of values computed per 
aggregation. 
   
   
   Here are some ideas [@stefanvodita](https://github.com/stefanvodita) and I 
encountered in our work and through exploration of what the other platforms 
support -
   
   #### New features
   
   * Support dynamic group definitions: In Solr,  defining an aggregation group 
is as simple as providing an iterator over documents. In Lucene, we can’t make 
arbitrary group definitions to that extent.
   * Support adding data for groups (ordinals). Association Facet Fields can do 
this; but one problem is that there’s no single authority for the data. For 
example, if we have an index of books where the Author is a Facet Field on the 
book document and we want to store the Author’s popularity, with Association 
Facet Fields, we’ll need to denormalize this value once per each book document. 
This introduces the possibility of some document changing/overwriting the 
intended value, inconsistently. For Taxonomy Facets, we could use the side car 
taxonomy index to efficiently add data about aggregation groups in normalized 
form: https://github.com/apache/lucene/pull/12337
   * Aggregation Expression Facets: Users may want to define expressions that 
reference other aggregation functions: 
https://github.com/apache/lucene/pull/12184. Note that this is different from 
the “Expression Facets” in the current Lucene demos - those are “document 
expression facets” and are expressions defined using fields on the document and 
do not use aggregations in the definition.
   * Nested aggregations: This is similar to the idea of Expression Facets in 
that they are an aggregation over other aggregations, but this time the parent 
aggregation references aggregations from lower levels in a hierarchy. For 
example: if we have an index of books with a hierarchical Facet Field for 
Author like <Nationality>/<Author>, we want to be able to answer queries like 
“What is the nationality of the author with the most books”? To do this, we’ll 
need to compute an aggregation A1 = count(books) per Author (level 1 in the 
hierarchy) and then do a max(A1) per Nationality (level 2 in the hierarchy).
   * [Cascading aggregation 
groups](https://github.com/apache/lucene/issues/4195): I think this feature 
corresponds to nested facets in Solr. A clothing store could have products 
categorized by size and color, as different dimensions. They might want to give 
customers a navigation experience that breaks down sizes by color. The customer 
might like blue, but see that the store is in short supply of blue items of the 
right size, and select a different color instead.
   * API to associate aggregation groups with the aggregation functions: For 
single valued hierarchical facet fields, Lucene uses roll-ups to compute 
aggregation across the different levels. However, for all other cases, it uses 
an approach I’ll refer to as jump-ups - for each document, we iterate through 
all possible unique groups (ordinals) relevant to the document (irrespective of 
their relationship via a hierarchy), and update an aggregation accumulator 
corresponding to the group (ordinals). (This is the values array indexed by 
ordinal in TaxonomyFacet implementations). Right now, even the jump-up approach 
ends up computing the aggregation function across all groups. However, it has 
the potential to compute exactly the number of aggregations needed. By 
associating aggregation groups with the aggregation functions users are be 
interested in, Lucene can compute exact number of aggregations. It could also 
be beneficial in cases where users are interested in a variety of aggregatio
 ns for different groups.
   * Enable deep hierarchies: Very deep ordinal hierarchies can have trouble 
scaling. Storing every ordinal in the path up to a label on a doc may not be 
feasible. Computing aggregations would be slower than it has to be if they are 
only needed for portions of the hierarchy. Short-circuiting roll-ups or 
aggregating directly into targeted ordinals would improve performance (see idea 
above).
   * Make facets generic: Some implementations depend on underlying primitives, 
which makes them efficient, but it also makes it so any improvement to 
IntTaxonomyFacets has to be reimplemented for FloatTaxonomyFacets. Also, if a 
user wants to define a new aggregation (e.g. AVG), they have to write a new 
faceting implementation too, since the existing facets don’t have generic 
accumulators. An AVG accumulator would need to maintain two values. The 
existing accumulators can only store one value per ordinal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] Shradha26 opened a new issue, #12553: [DISCUSS] Identifying Gaps in Lucene’s Faceting

Reply via email to