gsmiller opened a new issue, #15017: URL: https://github.com/apache/lucene/issues/15017
### Description Folks- I wanted to share some thoughts on a faceting improvement I've been considering for early feedback. As far as I can tell, this is an idea that hasn't been considered in the past, but it's possible I didn't uncover a previous conversation in my digging. #### Background / Context Faceting is commonly used for computing result counts associated with various filtering options. Sometimes these counts are directly shown to users, while sometimes they are used "under-the-hood" for validation. We do this in Amazon's product search engine, looking for filter options that would lead to a zero result count so we know to hide or disable it (in hopes of not leading users to empty result experiences). Faceting only computes counts "one step into the future;" counts must be re-computed each time a filter is applied to reflect the new state. For example, imagine an e-commerce site selling electronics with a user doing a search for "tv". Maybe there are non-zero results for an "8K resolution" filter option as well as a "32 inch screen size" option, but none of those results overlap. After the initial query, the user could successfully apply either of these filters independently, but the combination would lead to empty results. To guard against the empty results experience, facet counts must be re-computed once the user applies one of the filters to understand that the other would now have a zero count. (Sorry, this is probably all obvious, but trying to provide context for folks that don't use faceting as often). Since facet counting is a relatively light-weight operation, and the search must be re-issued each time a filter is applied, this is generally acceptable (i.e., just re-d o faceting every time you re-do the search... no big deal). The downside of this is that it couples refreshing filter elements in the UX with refreshing the search results themselves. If a user wants to interact with multiple filters, they have to wait for a "full refresh" between each action, which can feel pretty cumbersome. #### The Idea The faceting problem is really a set algebra problem. Each filter option is associated with some subset of the results, and we really want to know the cardinality after applying intersections and unions of additional filters. In our above example, imagine the user has selected the "8K resolution" option. To refresh our counts on the "32 inch screen size" option, we're logically doing an intersection of two sets (one associated with each option) and computing the cardinality of the resulting set. In our use-cases, we're happy to have an estimate of this cardinality; I suspect this is true for plenty of other use-cases as well. There are specialized data structures (distinct value sketches) for computing set cardinality estimates. What if we extended faceting to support populating one of these data structures for each unique filter value? Each data structure can estimate its own cardinality (not useful on its own since we already have counts)... but it gets better. These data structures hold up to set operations, allowing for the cardinality of arbitrary intersections and unions to also be computed. The idea here would be to compute these data structures on the initial query and make them available to the search application. Then, as filters get selected/de-selected, these independent data structures can be used in the application to re-estimate counts for the inactive filters without needing to re-compute faceting. As a result, we could build experiences that don't need to block on a new search call to refresh filtering options. Apache Datasketches has [a couple options](https://datasketches.apache.org/docs/Architecture/MajorSketchFamilies.html) that I think could be applicable. Specifically, theta sketches (an implementation of k-minimum values) is the most full-featured since it supports both set intersections and unions natively. HyperLogLog / CPC sketches also look really interesting. They don't natively support intersections (unions only) but are much more compact. The [inclusion-exclusion principle](https://en.wikipedia.org/wiki/Inclusion–exclusion_principle) could be used to estimate intersection cardinality (but there are some downsides as errors accumulate... it may not be applicable in all cases). #### Next Steps / Feedback I'm looking for feedback on a few things here: 1. Do folks have opinions on extending faceting functionality in some way to populate one of these sketch families as described? This might be an extension of faceting itself, or it might be some different abstraction. I don't have strong opinions here yet. 2. Do folks have opinions on the best sketch choice (or opinions on making a couple available for users to select from based on their needs... i.e., accuracy vs. data size)? Or maybe other options I haven't looked at that could solve this problem? 3. Is there any prior work in this space within Lucene or search engines in general that anyone is aware of? I haven't seen anything myself, but maybe there's something else to draw on for inspiration? If folks are open to this idea, I'll try to make time to play around with it a bit more. (Or happy to collaborate with someone else if interested!) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org