[GitHub] [lucene] gsmiller opened a new pull request, #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

via GitHub Sat, 18 Feb 2023 16:37:47 -0800


gsmiller opened a new pull request, #12156:
URL: https://github.com/apache/lucene/pull/12156


   ## Description
   
   `TermInSetQuery`'s implementation is more-or-less an exactly copy of 
`MultiTermQuery` +  `MultiTermQueryConstantScoreWrapper`. This PR removes the 
custom `TermInSetQuery` implementation in favor of extending `MultiTermQuery` 
with `MultiTermQueryConstantScoreWrapper` as the default rewrite behavior.
   
   One nice benefit of this (beyond code cleanup) is that different rewrite 
methods can be provided for different behavior. Specifically, we can leverage 
`DocValuesRewriteMethod` to completely replace `SortedSetDocValuesSetQuery` (I 
would propose doing that in a follow up PR once we're happy with this change).
   
   A couple notes about the change:
   1. I needed to give `MultiTermQueryConstantScoreWrapper` a `ScoreSupplier` 
implementation so `TermInSetQuery` can still be used efficiently with 
`IndexOrDocValuesQuery`. This also required a new (optional) public API in 
`MultiTermQuery` where queries can expose their number of terms (where 
applicable).
   2. Retaining the existing `TermInSetQuery#rewrite` behavior was slightly 
non-trivial. I don't really like how I've done this in the PR, but I also can't 
think of a better solution. I'm not actually convinced we need to retain this, 
but I'm not sure if client code might rely on the existing behavior in places. 
Even if that's the case, it might not be a good enough argument to keep it, but 
I did find a breaking unit test if I didn't keep it. 
(`TestPresearcherMatchCollector.testMatchCollectorShowMatches` assumes the 
query will get rewritten to a BQ). I'm not familiar enough with the `monitor` 
package to assess if it would be reasonable to just change the unit test, or if 
there's something more important going on here. If someone is more familiar 
and/or has opinions on the need to retain the existing "rewrite" behavior, I'd 
love some feedback.
   
   ## Performance
   When this has been discussed in the past, there's been an open question 
around performance since the term intersection happens a bit differently 
(`seekCeil` vs. `seekExact`). I ran some benchmarks using a one-off tool 
(similar to #12151) and found no noticeable regressions/issues. Here's the 
output of my tool (which you can check out here: 
   
[TiSBench.java.txt](https://github.com/apache/lucene/files/10775443/TiSBench.java.txt)
 )
   
   ### All Country Code Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 206.12           | 203.00            | 201.95           | 
119.97        |
   | TiS MTQ      | 193.67           | 190.54            | 189.43           | 
117.02        |
   
   ### Medium Cardinality + High Cost Country Code Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 132.42           | 127.95            | 126.15           | 
77.80         |
   | TiS MTQ      | 130.49           | 125.97            | 124.25           | 
77.31         |
   
   ### Low Cardinality + High Cost Country Code Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 239.00           | 110.95            | 36.51            | 
75.18         |
   | TiS MTQ      | 242.64           | 113.94            | 38.71            | 
76.00         |
   
   ### Medium Cardinality + Low Cost Country Code Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 2.51             | 1.99              | 1.67             | 
0.25          |
   | TiS MTQ      | 2.54             | 2.01              | 1.67             | 
0.28          |
   
   ### Low Cardinality + Low Cost Country Code Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 2.34             | 2.07              | 1.88             | 
0.44          |
   | TiS MTQ      | 2.20             | 1.91              | 1.68             | 
0.41          |
   
   ### High Cardinality PK Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 26.41            | 26.19             | 25.96            | 
6.36          |
   | TiS MTQ      | 27.78            | 27.65             | 27.32            | 
6.69          |
   
   ### Medium Cardinality PK Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 2.19             | 2.13              | 2.15             | 
0.49          |
   | TiS MTQ      | 2.33             | 2.28              | 2.25             | 
0.50          |
   
   ### Low Cardinality PK Filter Terms
   | Approach     | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--------------|------------------|-------------------|------------------|---------------|
   | TiS Original | 1.76             | 1.74              | 1.74             | 
0.34          |
   | TiS MTQ      | 1.32             | 1.29              | 1.28             | 
0.22          |
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene] gsmiller opened a new pull request, #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

Reply via email to