[PR] Reduce duplication in taxonomy facets; always do counts [lucene]

via GitHub Fri, 22 Dec 2023 02:14:14 -0800


stefanvodita opened a new pull request, #12966:
URL: https://github.com/apache/lucene/pull/12966


   ### Note
   
   This is a large change, refactoring most of the taxonomy facets code and 
changing internal behavior, without changing the API. There are specific API 
changes this sets us up to do later, e.g. retrieving counts from aggregation 
facets.
   
   ### What does this PR do well?
   
   1. Moves most of the responsibility from `TaxonomyFacets` implementations to 
`TaxonomyFacets` itself. This reduces code duplication and enables future 
development. Addresses genericity issue mentioned in #12553.
   2. As a consequence, it introduces sparse values to `FloatTaxonomyFacets`, 
which previously used dense values always. This issue is part of #12576.
   3. It computes counts for all taxonomy facets always, which enables us to 
add an API to retrieve counts for association facets in the future. Addresses 
#11282.
   4. As a consequence of having counts, we can check whether we encountered a 
label while faceting (`count > 0`), while previously we relied on the 
aggregation value to be positive. Closes #12585.
   5. It introduces the idea of doing multiple aggregations in one go, with 
association facets doing the aggregation they were already doing, plus a count. 
We can extend to an arbitrary number of aggregations, as suggested in #12546.
   6. It doesn't change the API. The only change in behavior users should 
notice is the fix for non-positive aggregation values, which were previously 
discarded.
   7. It adds tests which were missing for sparse/dense values and non-positive 
aggregations.
   
   ### What's not ideal about this approach?
   1. We could see some performance decreases. The more critical part of the 
work, aggregating, should be unaffected. There are a few extra method calls / 
dispatches / branches. Ranking and collecting results might be impacted because 
we are boxing / unboxing results to / from `Number` to avoid the primitive 
types.
   2. The way the `TopOrdAndNumberQueue`s work is a bit awkward and 
inefficient. It required small changes to classes outside the scope of this 
change. Maybe we can come up with something better.
   
   ## What is next?
   1. I'd like to know if the approach makes sense to others.
   2. We can try running some benchmarks to see if there are any performance 
changes.
   3. Is it important to preserve a default aggregation value of the right type 
in the results (i.e. `-1` for int aggregations, `-1f` for float aggregations)? 
If not, we can make a small simplification to always return `-1`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[PR] Reduce duplication in taxonomy facets; always do counts [lucene]

Reply via email to