jpountz commented on issue #15017: URL: https://github.com/apache/lucene/issues/15017#issuecomment-3148203349
> Since facet counting is a relatively light-weight operation This statement surprised me a bit since faceting tasks on nightly benchmarks run several times slower than top-k hits tasks? (e.g. [AndHighHighDayTaxoFacets](https://benchmarks.mikemccandless.com/AndHighHighDayTaxoFacets.html) vs. [AndHighHigh](https://benchmarks.mikemccandless.com/AndHighHigh.html), it's too bad these tasks don't use the exact same queries but I suspect that it's still true that if you remove facets from the faceting tasks and only compute top hits then the task runs much faster). > Is there any prior work in this space within Lucene or search engines in general that anyone is aware of? I haven't seen anything myself, but maybe there's something else to draw on for inspiration? The problem of finding values of a (low-cardinality) field that intersect with the query is a good candidate for dynamic filtering. For instance, say you want to find all categories that have at least one hit for the user's query. You could start with a collector that returns a disjunction over all categories as a start, and that removes them from the disjunction when visiting a hit of this category. Elasticsearch does this to speed up its cardinality aggregation: if you're counting the number of unique values of a field, once you've seen a value, there is no need to see the same value again. This is much faster than fully evaluating the query if the field has a relatively low cardinality. Tencent has done some faceting optimizations using Lucene by taking advantage of index sorting: https://dl.acm.org/doi/abs/10.14778/3554821.3554837. In their case they're trying to optimize histogram facets, but I think that your problem could take advantage of index sorting as well. For instance, imagine if your index was sorted by category then brand. You could then statically map (category, brand) pairs to (per-segment) ranges of doc IDs that have these values. Then you can check whether (category, brand) pairs co-exist within the matches of a query only by checking if the query has matches in the corresponding ranges of doc IDs. Again, this is a good candidate for dynamic filtering since you only need to evaluate one doc ID per range. You may not be able to handle all your facet fields this way, but I could imagine a hybrid solution where the main dimensions used for faceting are used to configure the index sort, and other facet fields use a different solution like the above p aragraph or your proposal. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org