herley-shaori opened a new pull request, #15825:
URL: https://github.com/apache/lucene/pull/15825

   ## Summary
   
   Fixes #15812
   
   `CJKBigramFilter` produces different token positions for the same input 
depending on whether `outputUnigrams` is `true` or `false`. This breaks phrase 
queries when index-time and search-time analyzers use different 
`outputUnigrams` settings — a common optimization pattern for CJK search.
   
   ### Root cause
   
   In `flushBigram()`, when `outputUnigrams=false`, bigrams are emitted with 
the default `positionIncrement=1`, but a bigram conceptually spans **two** 
character positions. After a word break (punctuation, whitespace, or non-CJK 
text), subsequent tokens are assigned positions that are off by 1 compared to 
the `outputUnigrams=true` case.
   
   Example with input `"一二、三"`:
   ```
   outputUnigrams=true:  一(pos0) 一二(pos0) 二(pos1) 三(pos2)
   outputUnigrams=false: 一二(pos0) 三(pos1) ← should be pos2
   ```
   
   ### Fix
   
   Following the principle suggested by @rmuir — `outputUnigrams=false` should 
behave **as if unigrams were emitted, then later removed** — this PR tracks 
whether bigrams were emitted from the current CJK segment and defers an extra 
position increment (`+1`) to apply to the first token after a segment boundary.
   
   Two new fields in `CJKBigramFilter`:
   - `hadBigrams`: set `true` when a bigram is flushed in no-unigram mode
   - `deferredPosInc`: accumulated extra position increment, applied at the 
next segment transition (unaligned offsets, non-CJK token, or end of stream)
   
   The deferred increment is applied in `flushBigram()`, `flushUnigram()`, and 
the non-CJK passthrough path in `incrementToken()`.
   
   ### Changes
   
   - **`CJKBigramFilter.java`**: Added position tracking logic across CJK 
segment boundaries
   - **`TestCJKBigramFilter.java`**: Added 3 new test cases reproducing the 
bug; updated `testHanOnly` expected positions
   - **`TestWithCJKBigramFilter.java`** (ICU): Updated expected positions in 
`testJa2`, `testMix`, `testMix2`, `testReusableTokenStream`, and 
`testFinalOffset`
   - **`CHANGES.txt`**: Added bug fix entry
   
   ## Test plan
   
   - [x] All 15 CJKBigramFilter tests pass (including 3 new tests)
   - [x] All 12 ICU TestWithCJKBigramFilter tests pass
   - [x] Code formatting verified via `./gradlew tidy`
   - [x] `testBigramPositionsConsistentAcrossWordBreak` — reproduces exact 
scenario from issue
   - [x] `testBigramPositionsMultipleSegments` — verifies across multiple CJK 
segments with breaks
   - [x] `testBigramPositionsBeforeNonCJK` — verifies CJK bigram followed by 
non-CJK text


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to