Re: [I] Faceting + Data Sketches [lucene]

via GitHub Sun, 03 Aug 2025 01:16:37 -0700


jpountz commented on issue #15017:
URL: https://github.com/apache/lucene/issues/15017#issuecomment-3148203349

> Since facet counting is a relatively light-weight operation

This statement surprised me a bit since faceting tasks on nightly benchmarks
run several times slower than top-k hits tasks? (e.g.
[AndHighHighDayTaxoFacets](https://benchmarks.mikemccandless.com/AndHighHighDayTaxoFacets.html)
vs. [AndHighHigh](https://benchmarks.mikemccandless.com/AndHighHigh.html),
it's too bad these tasks don't use the exact same queries but I suspect that
it's still true that if you remove facets from the faceting tasks and only
compute top hits then the task runs much faster).

> Is there any prior work in this space within Lucene or search engines in
general that anyone is aware of? I haven't seen anything myself, but maybe
there's something else to draw on for inspiration?

The problem of finding values of a (low-cardinality) field that intersect
with the query is a good candidate for dynamic filtering. For instance, say you
want to find all categories that have at least one hit for the user's query.
You could start with a collector that returns a disjunction over all categories
as a start, and that removes them from the disjunction when visiting a hit of
this category. Elasticsearch does this to speed up its cardinality aggregation:
if you're counting the number of unique values of a field, once you've seen a
value, there is no need to see the same value again. This is much faster than
fully evaluating the query if the field has a relatively low cardinality.

Tencent has done some faceting optimizations using Lucene by taking
advantage of index sorting:
https://dl.acm.org/doi/abs/10.14778/3554821.3554837. In their case they're
trying to optimize histogram facets, but I think that your problem could take
advantage of index sorting as well. For instance, imagine if your index was
sorted by category then brand. You could then statically map (category, brand)
pairs to (per-segment) ranges of doc IDs that have these values. Then you can
check whether (category, brand) pairs co-exist within the matches of a query
only by checking if the query has matches in the corresponding ranges of doc
IDs. Again, this is a good candidate for dynamic filtering since you only need
to evaluate one doc ID per range. You may not be able to handle all your facet
fields this way, but I could imagine a hybrid solution where the main
dimensions used for faceting are used to configure the index sort, and other
facet fields use a different solution like the above p
aragraph or your proposal.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Faceting + Data Sketches [lucene]

Reply via email to