[GitHub] [lucene] twosom closed pull request #12044: some cleanup and refactoring codes of analysis-nori & analysis-kuromoji

2023-02-18 Thread via GitHub


twosom closed pull request #12044: some cleanup and refactoring codes of 
analysis-nori & analysis-kuromoji
URL: https://github.com/apache/lucene/pull/12044


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] twosom closed pull request #12040: Minor refactoring and cleanup to BooleanQuery code

2023-02-18 Thread via GitHub


twosom closed pull request #12040: Minor refactoring and cleanup to 
BooleanQuery code
URL: https://github.com/apache/lucene/pull/12040


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #12155: Bring "max ord" check optimization in SortedSetDocValuesSetQuery over…

2023-02-18 Thread via GitHub


gsmiller opened a new pull request, #12155:
URL: https://github.com/apache/lucene/pull/12155

   … to DocValuesRewriteMethod
   
   ### Description
   
   This brings over the recent "max ord" optimization in 
`SortedSetDocValuesSetQuery` (#12129) to `DocValuesRewriteMethod`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request, #12156: Remove custom TermInSetQuery implementation in favor of extending MultiTermQuery

2023-02-18 Thread via GitHub


gsmiller opened a new pull request, #12156:
URL: https://github.com/apache/lucene/pull/12156

   ## Description
   
   `TermInSetQuery`'s implementation is more-or-less an exactly copy of 
`MultiTermQuery` +  `MultiTermQueryConstantScoreWrapper`. This PR removes the 
custom `TermInSetQuery` implementation in favor of extending `MultiTermQuery` 
with `MultiTermQueryConstantScoreWrapper` as the default rewrite behavior.
   
   One nice benefit of this (beyond code cleanup) is that different rewrite 
methods can be provided for different behavior. Specifically, we can leverage 
`DocValuesRewriteMethod` to completely replace `SortedSetDocValuesSetQuery` (I 
would propose doing that in a follow up PR once we're happy with this change).
   
   A couple notes about the change:
   1. I needed to give `MultiTermQueryConstantScoreWrapper` a `ScoreSupplier` 
implementation so `TermInSetQuery` can still be used efficiently with 
`IndexOrDocValuesQuery`. This also required a new (optional) public API in 
`MultiTermQuery` where queries can expose their number of terms (where 
applicable).
   2. Retaining the existing `TermInSetQuery#rewrite` behavior was slightly 
non-trivial. I don't really like how I've done this in the PR, but I also can't 
think of a better solution. I'm not actually convinced we need to retain this, 
but I'm not sure if client code might rely on the existing behavior in places. 
Even if that's the case, it might not be a good enough argument to keep it, but 
I did find a breaking unit test if I didn't keep it. 
(`TestPresearcherMatchCollector.testMatchCollectorShowMatches` assumes the 
query will get rewritten to a BQ). I'm not familiar enough with the `monitor` 
package to assess if it would be reasonable to just change the unit test, or if 
there's something more important going on here. If someone is more familiar 
and/or has opinions on the need to retain the existing "rewrite" behavior, I'd 
love some feedback.
   
   ## Performance
   When this has been discussed in the past, there's been an open question 
around performance since the term intersection happens a bit differently 
(`seekCeil` vs. `seekExact`). I ran some benchmarks using a one-off tool 
(similar to #12151) and found no noticeable regressions/issues. Here's the 
output of my tool (which you can check out here: 
   
[TiSBench.java.txt](https://github.com/apache/lucene/files/10775443/TiSBench.java.txt)
 )
   
   ### All Country Code Filter Terms
   | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--|--|---|--|---|
   | TiS Original | 206.12   | 203.00| 201.95   | 
119.97|
   | TiS MTQ  | 193.67   | 190.54| 189.43   | 
117.02|
   
   ### Medium Cardinality + High Cost Country Code Filter Terms
   | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--|--|---|--|---|
   | TiS Original | 132.42   | 127.95| 126.15   | 
77.80 |
   | TiS MTQ  | 130.49   | 125.97| 124.25   | 
77.31 |
   
   ### Low Cardinality + High Cost Country Code Filter Terms
   | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--|--|---|--|---|
   | TiS Original | 239.00   | 110.95| 36.51| 
75.18 |
   | TiS MTQ  | 242.64   | 113.94| 38.71| 
76.00 |
   
   ### Medium Cardinality + Low Cost Country Code Filter Terms
   | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--|--|---|--|---|
   | TiS Original | 2.51 | 1.99  | 1.67 | 
0.25  |
   | TiS MTQ  | 2.54 | 2.01  | 1.67 | 
0.28  |
   
   ### Low Cardinality + Low Cost Country Code Filter Terms
   | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--|--|---|--|---|
   | TiS Original | 2.34 | 2.07  | 1.88 | 
0.44  |
   | TiS MTQ  | 2.20 | 1.91  | 1.68 | 
0.41  |
   
   ### High Cardinality PK Filter Terms
   | Approach | Large Lead Terms | Medium Lead Terms | Small Lead Terms | 
No Lead Terms |
   
|--|--|---|--|---|
   | TiS Original | 26.41| 26.19 | 25.96| 
6.36  |
   | TiS MTQ  | 27.78| 

[GitHub] [lucene] gsmiller commented on pull request #12055: Better skipping for multi-term queries with a FILTER rewrite.

2023-02-18 Thread via GitHub


gsmiller commented on PR #12055:
URL: https://github.com/apache/lucene/pull/12055#issuecomment-1435802423

   As one more update, I just opened a new PR to have `TermInSetQuery` extend 
`MultiTermQuery` instead of having its own custom implementation (#12156). I 
was able to do some benchmarking and it looks reasonable to me. There are 
probably a couple of rough-edges to sort out on the PR, but that would enable 
us to remove the `TermInSetQuery` changes present in this PR.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org