Re: [PR] Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled [lucene]

via GitHub Sun, 15 Mar 2026 20:34:39 -0700


herley-shaori commented on code in PR #15825:
URL: https://github.com/apache/lucene/pull/15825#discussion_r2937973100



##########
lucene/analysis/common/src/java/org/apache/lucene/analysis/cjk/CJKBigramFilter.java:
##########
@@ -333,6 +360,10 @@ private void flushUnigram() {
     termAtt.setLength(len);
     offsetAtt.setOffset(startOffset[index], endOffset[index]);
     typeAtt.setType(SINGLE_TYPE);
+    if (!outputUnigrams && deferredPosInc > 0) {

Review Comment:
   Thanks for the review! Applied your suggestion and extended the same 
reasoning to the other guards:
                                                                                
                                         
     - flushUnigram(): removed !outputUnigrams && (your suggestion)             
 
     - flushBigram(): added if (deferredPosInc > 0) guard to skip the redundant 
setPositionIncrement(1) when
     clearAttributes() already defaults to 1
     - incrementToken() (both segment boundary checks): removed !outputUnigrams 
&& before hadBigrams — since hadBigrams is only ever set true inside the 
!outputUnigrams branch of flushBigram(), the outer check is redundant.
   
   Also fixed TestCJKAnalyzer (testJa2, testMix, testMix2, 
testReusableTokenStream, testFinalOffset) — same position increment updates 
needed since CJKAnalyzer uses CJKBigramFilter with outputUnigrams=false.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Fix CJKBigramFilter inconsistent positions with outputUnigrams disabled [lucene]

Reply via email to