[I] Faceting + Data Sketches [lucene]

via GitHub Thu, 31 Jul 2025 08:01:55 -0700


gsmiller opened a new issue, #15017:
URL: https://github.com/apache/lucene/issues/15017


   ### Description
   
   Folks- I wanted to share some thoughts on a faceting improvement I've been 
considering for early feedback. As far as I can tell, this is an idea that 
hasn't been considered in the past, but it's possible I didn't uncover a 
previous conversation in my digging.
   
   #### Background / Context
   Faceting is commonly used for computing result counts associated with 
various filtering options. Sometimes these counts are directly shown to users, 
while sometimes they are used "under-the-hood" for validation. We do this in 
Amazon's product search engine, looking for filter options that would lead to a 
zero result count so we know to hide or disable it (in hopes of not leading 
users to empty result experiences).
   
   Faceting only computes counts "one step into the future;" counts must be 
re-computed each time a filter is applied to reflect the new state. For 
example, imagine an e-commerce site selling electronics with a user doing a 
search for "tv". Maybe there are non-zero results for an "8K resolution" filter 
option as well as a "32 inch screen size" option, but none of those results 
overlap. After the initial query, the user could successfully apply either of 
these filters independently, but the combination would lead to empty results. 
To guard against the empty results experience, facet counts must be re-computed 
once the user applies one of the filters to understand that the other would now 
have a zero count. (Sorry, this is probably all obvious, but trying to provide 
context for folks that don't use faceting as often). Since facet counting is a 
relatively light-weight operation, and the search must be re-issued each time a 
filter is applied, this is generally acceptable (i.e., just re-d
 o faceting every time you re-do the search... no big deal). The downside of 
this is that it couples refreshing filter elements in the UX with refreshing 
the search results themselves. If a user wants to interact with multiple 
filters, they have to wait for a "full refresh" between each action, which can 
feel pretty cumbersome.
   
   #### The Idea
   The faceting problem is really a set algebra problem. Each filter option is 
associated with some subset of the results, and we really want to know the 
cardinality after applying intersections and unions of additional filters. In 
our above example, imagine the user has selected the "8K resolution" option. To 
refresh our counts on the "32 inch screen size" option, we're logically doing 
an intersection of two sets (one associated with each option) and computing the 
cardinality of the resulting set. In our use-cases, we're happy to have an 
estimate of this cardinality; I suspect this is true for plenty of other 
use-cases as well.
   
   There are specialized data structures (distinct value sketches) for 
computing set cardinality estimates. What if we extended faceting to support 
populating one of these data structures for each unique filter value? Each data 
structure can estimate its own cardinality (not useful on its own since we 
already have counts)... but it gets better. These data structures hold up to 
set operations, allowing for the cardinality of arbitrary intersections and 
unions to also be computed. The idea here would be to compute these data 
structures on the initial query and make them available to the search 
application. Then, as filters get selected/de-selected, these independent data 
structures can be used in the application to re-estimate counts for the 
inactive filters without needing to re-compute faceting. As a result, we could 
build experiences that don't need to block on a new search call to refresh 
filtering options.
   
   Apache Datasketches has [a couple 
options](https://datasketches.apache.org/docs/Architecture/MajorSketchFamilies.html)
 that I think could be applicable. Specifically, theta sketches (an 
implementation of k-minimum values) is the most full-featured since it supports 
both set intersections and unions natively. HyperLogLog / CPC sketches also 
look really interesting. They don't natively support intersections (unions 
only) but are much more compact. The [inclusion-exclusion 
principle](https://en.wikipedia.org/wiki/Inclusion–exclusion_principle) could 
be used to estimate intersection cardinality (but there are some downsides as 
errors accumulate... it may not be applicable in all cases). 
   
   #### Next Steps / Feedback
   I'm looking for feedback on a few things here:
   1. Do folks have opinions on extending faceting functionality in some way to 
populate one of these sketch families as described? This might be an extension 
of faceting itself, or it might be some different abstraction. I don't have 
strong opinions here yet.
   2. Do folks have opinions on the best sketch choice (or opinions on making a 
couple available for users to select from based on their needs... i.e., 
accuracy vs. data size)? Or maybe other options I haven't looked at that could 
solve this problem?
   3. Is there any prior work in this space within Lucene or search engines in 
general that anyone is aware of? I haven't seen anything myself, but maybe 
there's something else to draw on for inspiration?
   
   If folks are open to this idea, I'll try to make time to play around with it 
a bit more. (Or happy to collaborate with someone else if interested!)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Faceting + Data Sketches [lucene]

Reply via email to