txwei opened a new issue, #16223:
URL: https://github.com/apache/lucene/issues/16223

   ### Description
   
   **Summary**
   
   Since 10.0, an unanchored MultiTermQuery — a leading wildcard like `*foo*`, 
or a leading-.* regexp — used as a FILTER/MUST clause in a BooleanQuery can be 
dramatically slower than in 9x, even when the overall query matches zero 
documents.
   
   The cause is that the term-dictionary scan for these queries moved from a 
lazy step (`ScorerSupplier#get()`) to an eager one (`Weight#scorerSupplier()`). 
`scorerSupplier()` is meant to be the cheap "planning" phase, and doing the 
scan there defeats a parent conjunction's ability to short-circuit before the 
scan runs.
   
   
   **Root cause**
   
   `AbstractMultiTermQueryConstantScoreWrapper#scorerSupplier()` now calls 
`collectTerms()` eagerly (to compute an accurate cost and to return null when 
no terms match, so a parent BooleanQuery can short-circuit). For a query with 
an unknown term count (any automaton query — `getTermsCount() == -1`), 
collectTerms walks the field's term dictionary, and a leading wildcard can't 
seek, so it must visit every term. The worst case is when it matches few/no 
terms and never reach the 16-term threshold, it scans the entire term 
dictionary.
   
   Because BooleanWeight builds a ScorerSupplier for every clause up front, 
this scan runs before the conjunction can discover that a sibling required 
clause matches nothing. In 9 the scan lived in `get()`, so an empty sibling 
short-circuited the conjunction and the wildcard's get() was never called.
   
   **Performance analysis**
   From local benchmarking
   | scenario | 9.11.1 | 10.1.0 | change |
   |---|---|---|---|
   | Build the `ScorerSupplier` only (no scorer built yet) | ~0.01 ms | ~41 ms 
| **~4000× slower** |
   | `FILTER(*foo*) AND FILTER(<term that matches 0 docs>)` → 0 hits | ~0.15 ms 
| ~53 ms | **~350× slower** |
   | Wildcard scorer actually built (`ScorerSupplier#get()`), nothing skips it 
| ~40 ms | ~50 ms | ~unchanged |
   
   **Potential fixes (Open for discussion)**
   1. Defer collectTerms to get() when term count is unknown 
https://github.com/apache/lucene/pull/16222. We'll keep the eager path only for 
known, bounded term sets. The trade-off here is it reverts some of the cost 
estimation improvements brought by #13201.
   2. A more targeted "is this automaton seekable/anchored?" signal so anchored 
automaton queries keep precise cost and only truly-unanchored ones defer. 
Cleaner in principle, but there's no obvious cheap signal (a non-empty common 
prefix doesn't guarantee a cheap scan).
   3. Bound the eager scan effort and fall back to lazy? This sounds over 
complicated.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to