magibney opened a new pull request, #12207: URL: https://github.com/apache/lucene/pull/12207
For all but extremely selective queries (which are quite fast anyway), the performance of `PrefixQuery` benefits from directly subclassing `MultiTermQuery`, as opposed to subclassing `AutomatonQuery`. This PR proposes a new approach that initially seeks the exact threshold term (the first beyond the range defined by prefix) for each segment. Once you know the exact threshold term, you can determine a single index (of term BytesRef) that differentiates the threshold term from any term matching the prefix; this is the only index we must inspect for each term in the `FilteredTermsEnum`. This approach doesn't work for `AutomatonQuery` generally; but for `PrefixQuery`, where matching terms are guaranteed to appear in a single contiguous range, it's quite straightforward. This also also removes the need for the arbitrary limitation on prefix length (which was a consequence of recursive parsing of prefixes in Automaton builder). Note: this PR still supports lazily building an Automaton in `visitQuery()`, which at the moment is I think only necessary to support highlighting. (TODO: I suspect that changing the `QueryVisitor` API to be more explicitly term-oriented as opposed to Automaton-oriented may be appropriate, and would remove the need for this accommodation). I also left (for now, at least) the prefix automaton as leveraged by `Intervals`. Because this optimization targets the "phase 1/terms collection" part of `MultiTermQuery`, the benefits are greatest in cases where the number of terms matched by the prefix is large, and where the performance impact of "phase 2" -- running the rewritten query -- doesn't overwhelm the cost of "phase 1". "phase 2" cost is proportionally high (relative to field cardinality) for larger indexes, so `wikimedium10k` shows substantial improvement (~50% faster for single-character/broad prefixes), while `wikimedium10m` shows only ~5-10% improvement for the same tasks. For both 10k and 10m, it appears that performance of extremely selective prefixes decreases slightly (~1% for both cases), probably as a result of the extra initial limit threshold seek. But the performance of the most negatively-impacted queries is quite good to begin with, so the absolute gains for broader prefixes should far outweigh the slight regression for the selective case. It's also worth noting that autocomplete (a common use case for prefix queries) tends to have relatively small indexes, so would expect to see greater proportional gains. Fields with content heavily weighted toward particular shared prefixes should also see significant benefits (broad prefixes benefitting the most, and there's a limit to how broad a prefix can be on a field with a relatively even distribution of prefixes). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org