[GitHub] [lucene] magibney opened a new pull request, #12207: simplify PrefixQuery to avoid requiring Automaton

via GitHub Thu, 16 Mar 2023 09:58:34 -0700


magibney opened a new pull request, #12207:
URL: https://github.com/apache/lucene/pull/12207


   For all but extremely selective queries (which are quite fast anyway), the 
performance of `PrefixQuery` benefits from directly subclassing 
`MultiTermQuery`, as opposed to subclassing `AutomatonQuery`.
   
   This PR proposes a new approach that initially seeks the exact threshold 
term (the first beyond the range defined by prefix) for each segment. Once you 
know the exact threshold term, you can determine a single index (of term 
BytesRef) that differentiates the threshold term from any term matching the 
prefix; this is the only index we must inspect for each term in the 
`FilteredTermsEnum`.
   
   This approach doesn't work for `AutomatonQuery` generally; but for 
`PrefixQuery`, where matching terms are guaranteed to appear in a single 
contiguous range, it's quite straightforward. This also also removes the need 
for the arbitrary limitation on prefix length (which was a consequence of 
recursive parsing of prefixes in Automaton builder).
   
   Note: this PR still supports lazily building an Automaton in `visitQuery()`, 
which at the moment is I think only necessary to support highlighting. (TODO: I 
suspect that changing the `QueryVisitor` API to be more explicitly 
term-oriented as opposed to Automaton-oriented may be appropriate, and would 
remove the need for this accommodation). I also left (for now, at least) the 
prefix automaton as leveraged by `Intervals`.
   
   Because this optimization targets the "phase 1/terms collection" part of 
`MultiTermQuery`, the benefits are greatest in cases where the number of terms 
matched by the prefix is large, and where the performance impact of "phase 2" 
-- running the rewritten query -- doesn't overwhelm the cost of "phase 1". 
"phase 2" cost is proportionally high (relative to field cardinality) for 
larger indexes, so `wikimedium10k` shows substantial improvement (~50% faster 
for single-character/broad prefixes), while `wikimedium10m` shows only ~5-10% 
improvement for the same tasks.
   
   For both 10k and 10m, it appears that performance of extremely selective 
prefixes decreases slightly (~1% for both cases), probably as a result of the 
extra initial limit threshold seek. But the performance of the most 
negatively-impacted queries is quite good to begin with, so the absolute gains 
for broader prefixes should far outweigh the slight regression for the 
selective case.
   
   It's also worth noting that autocomplete (a common use case for prefix 
queries) tends to have relatively small indexes, so would expect to see greater 
proportional gains.
   
   Fields with content heavily weighted toward particular shared prefixes 
should also see significant benefits (broad prefixes benefitting the most, and 
there's a limit to how broad a prefix can be on a field with a relatively even 
distribution of prefixes).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] magibney opened a new pull request, #12207: simplify PrefixQuery to avoid requiring Automaton

Reply via email to