herley-shaori opened a new pull request, #15825: URL: https://github.com/apache/lucene/pull/15825
## Summary Fixes #15812 `CJKBigramFilter` produces different token positions for the same input depending on whether `outputUnigrams` is `true` or `false`. This breaks phrase queries when index-time and search-time analyzers use different `outputUnigrams` settings — a common optimization pattern for CJK search. ### Root cause In `flushBigram()`, when `outputUnigrams=false`, bigrams are emitted with the default `positionIncrement=1`, but a bigram conceptually spans **two** character positions. After a word break (punctuation, whitespace, or non-CJK text), subsequent tokens are assigned positions that are off by 1 compared to the `outputUnigrams=true` case. Example with input `"一二、三"`: ``` outputUnigrams=true: 一(pos0) 一二(pos0) 二(pos1) 三(pos2) outputUnigrams=false: 一二(pos0) 三(pos1) ← should be pos2 ``` ### Fix Following the principle suggested by @rmuir — `outputUnigrams=false` should behave **as if unigrams were emitted, then later removed** — this PR tracks whether bigrams were emitted from the current CJK segment and defers an extra position increment (`+1`) to apply to the first token after a segment boundary. Two new fields in `CJKBigramFilter`: - `hadBigrams`: set `true` when a bigram is flushed in no-unigram mode - `deferredPosInc`: accumulated extra position increment, applied at the next segment transition (unaligned offsets, non-CJK token, or end of stream) The deferred increment is applied in `flushBigram()`, `flushUnigram()`, and the non-CJK passthrough path in `incrementToken()`. ### Changes - **`CJKBigramFilter.java`**: Added position tracking logic across CJK segment boundaries - **`TestCJKBigramFilter.java`**: Added 3 new test cases reproducing the bug; updated `testHanOnly` expected positions - **`TestWithCJKBigramFilter.java`** (ICU): Updated expected positions in `testJa2`, `testMix`, `testMix2`, `testReusableTokenStream`, and `testFinalOffset` - **`CHANGES.txt`**: Added bug fix entry ## Test plan - [x] All 15 CJKBigramFilter tests pass (including 3 new tests) - [x] All 12 ICU TestWithCJKBigramFilter tests pass - [x] Code formatting verified via `./gradlew tidy` - [x] `testBigramPositionsConsistentAcrossWordBreak` — reproduces exact scenario from issue - [x] `testBigramPositionsMultipleSegments` — verifies across multiple CJK segments with breaks - [x] `testBigramPositionsBeforeNonCJK` — verifies CJK bigram followed by non-CJK text -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
