gsmiller opened a new pull request, #12156: URL: https://github.com/apache/lucene/pull/12156
## Description `TermInSetQuery`'s implementation is more-or-less an exactly copy of `MultiTermQuery` + `MultiTermQueryConstantScoreWrapper`. This PR removes the custom `TermInSetQuery` implementation in favor of extending `MultiTermQuery` with `MultiTermQueryConstantScoreWrapper` as the default rewrite behavior. One nice benefit of this (beyond code cleanup) is that different rewrite methods can be provided for different behavior. Specifically, we can leverage `DocValuesRewriteMethod` to completely replace `SortedSetDocValuesSetQuery` (I would propose doing that in a follow up PR once we're happy with this change). A couple notes about the change: 1. I needed to give `MultiTermQueryConstantScoreWrapper` a `ScoreSupplier` implementation so `TermInSetQuery` can still be used efficiently with `IndexOrDocValuesQuery`. This also required a new (optional) public API in `MultiTermQuery` where queries can expose their number of terms (where applicable). 2. Retaining the existing `TermInSetQuery#rewrite` behavior was slightly non-trivial. I don't really like how I've done this in the PR, but I also can't think of a better solution. I'm not actually convinced we need to retain this, but I'm not sure if client code might rely on the existing behavior in places. Even if that's the case, it might not be a good enough argument to keep it, but I did find a breaking unit test if I didn't keep it. (`TestPresearcherMatchCollector.testMatchCollectorShowMatches` assumes the query will get rewritten to a BQ). I'm not familiar enough with the `monitor` package to assess if it would be reasonable to just change the unit test, or if there's something more important going on here. If someone is more familiar and/or has opinions on the need to retain the existing "rewrite" behavior, I'd love some feedback. ## Performance When this has been discussed in the past, there's been an open question around performance since the term intersection happens a bit differently (`seekCeil` vs. `seekExact`). I ran some benchmarks using a one-off tool (similar to #12151) and found no noticeable regressions/issues. Here's the output of my tool (which you can check out here: [TiSBench.java.txt](https://github.com/apache/lucene/files/10775443/TiSBench.java.txt) ) ### All Country Code Filter Terms | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms | |--------------|------------------|-------------------|------------------|---------------| | TiS Original | 206.12 | 203.00 | 201.95 | 119.97 | | TiS MTQ | 193.67 | 190.54 | 189.43 | 117.02 | ### Medium Cardinality + High Cost Country Code Filter Terms | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms | |--------------|------------------|-------------------|------------------|---------------| | TiS Original | 132.42 | 127.95 | 126.15 | 77.80 | | TiS MTQ | 130.49 | 125.97 | 124.25 | 77.31 | ### Low Cardinality + High Cost Country Code Filter Terms | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms | |--------------|------------------|-------------------|------------------|---------------| | TiS Original | 239.00 | 110.95 | 36.51 | 75.18 | | TiS MTQ | 242.64 | 113.94 | 38.71 | 76.00 | ### Medium Cardinality + Low Cost Country Code Filter Terms | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms | |--------------|------------------|-------------------|------------------|---------------| | TiS Original | 2.51 | 1.99 | 1.67 | 0.25 | | TiS MTQ | 2.54 | 2.01 | 1.67 | 0.28 | ### Low Cardinality + Low Cost Country Code Filter Terms | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms | |--------------|------------------|-------------------|------------------|---------------| | TiS Original | 2.34 | 2.07 | 1.88 | 0.44 | | TiS MTQ | 2.20 | 1.91 | 1.68 | 0.41 | ### High Cardinality PK Filter Terms | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms | |--------------|------------------|-------------------|------------------|---------------| | TiS Original | 26.41 | 26.19 | 25.96 | 6.36 | | TiS MTQ | 27.78 | 27.65 | 27.32 | 6.69 | ### Medium Cardinality PK Filter Terms | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms | |--------------|------------------|-------------------|------------------|---------------| | TiS Original | 2.19 | 2.13 | 2.15 | 0.49 | | TiS MTQ | 2.33 | 2.28 | 2.25 | 0.50 | ### Low Cardinality PK Filter Terms | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | No Lead Terms | |--------------|------------------|-------------------|------------------|---------------| | TiS Original | 1.76 | 1.74 | 1.74 | 0.34 | | TiS MTQ | 1.32 | 1.29 | 1.28 | 0.22 | -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org