magibney commented on PR #12207: URL: https://github.com/apache/lucene/pull/12207#issuecomment-1478015886
The performance of the approach taken by this proposal comes from the fact that when you know the exact term of the limit threshold, you can determine a single index that will suffice to compare for every candidate term in the source TermsEnum. So beyond the cost of an extra terms dictionary seek (or two), you're guaranteed to compare exactly one byte per term in filtering. This proposed implementation is very simple, but for PrefixQuery, simple is appropriate, given that we know this is always going to be a linear scan of terms. The benefit is seen for cases where the cost of terms iteration is relatively large. One such case is "smaller indexes", but the motivating case is actually "longer prefixes matching larger numbers of terms" (e.g., URLs, taxonomies), which is hard to demonstrate with the consistent fanout of the standard benchmarking data. Not easily reproducible for now (sorry!), but for an index with 33m docs, faceting on a field of cardinality 2.3m, a prefix covering 212k unique values (~10% of terms) is consistently ~30% faster with the new approach than with an automaton-based approach. When the prefix covers 2.1m (~90% of terms -- crazy I know, but it happens), the new approach is consistently ~40% faster. (For transparency, I'm set up to easily test this on Lucene 8.8, so that's what these numbers are coming from). And as large as these speedups are percentage-wise, the absolute difference is even greater, given that the largest impact is on the slower queries (request latency for 10% and 90% prefix coverage are respectively 120ms/180ms, 670ms/1150ms). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org