jpountz commented on issue #15017:
URL: https://github.com/apache/lucene/issues/15017#issuecomment-3148203349

   > Since facet counting is a relatively light-weight operation
   
   This statement surprised me a bit since faceting tasks on nightly benchmarks 
run several times slower than top-k hits tasks? (e.g. 
[AndHighHighDayTaxoFacets](https://benchmarks.mikemccandless.com/AndHighHighDayTaxoFacets.html)
 vs. [AndHighHigh](https://benchmarks.mikemccandless.com/AndHighHigh.html), 
it's too bad these tasks don't use the exact same queries but I suspect that 
it's still true that if you remove facets from the faceting tasks and only 
compute top hits then the task runs much faster). 
   
   > Is there any prior work in this space within Lucene or search engines in 
general that anyone is aware of? I haven't seen anything myself, but maybe 
there's something else to draw on for inspiration?
   
   The problem of finding values of a (low-cardinality) field that intersect 
with the query is a good candidate for dynamic filtering. For instance, say you 
want to find all categories that have at least one hit for the user's query. 
You could start with a collector that returns a disjunction over all categories 
as a start, and that removes them from the disjunction when visiting a hit of 
this category. Elasticsearch does this to speed up its cardinality aggregation: 
if you're counting the number of unique values of a field, once you've seen a 
value, there is no need to see the same value again. This is much faster than 
fully evaluating the query if the field has a relatively low cardinality.
   
   Tencent has done some faceting optimizations using Lucene by taking 
advantage of index sorting: 
https://dl.acm.org/doi/abs/10.14778/3554821.3554837. In their case they're 
trying to optimize histogram facets, but I think that your problem could take 
advantage of index sorting as well. For instance, imagine if your index was 
sorted by category then brand. You could then statically map (category, brand) 
pairs to (per-segment) ranges of doc IDs that have these values. Then you can 
check whether (category, brand) pairs co-exist within the matches of a query 
only by checking if the query has matches in the corresponding ranges of doc 
IDs. Again, this is a good candidate for dynamic filtering since you only need 
to evaluate one doc ID per range. You may not be able to handle all your facet 
fields this way, but I could imagine a hybrid solution where the main 
dimensions used for faceting are used to configure the index sort, and other 
facet fields use a different solution like the above p
 aragraph or your proposal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to