mmatela opened a new issue, #12080:
URL: https://github.com/apache/lucene/issues/12080

   ### Description
   
   In my example, the query is 'test polskie'.
   I use MorfologikFilter for Polish stemming, it turns 'polskie' into 'polski' 
+ 'polskie'.
   I also use SynonymGraphFilter which turns 'polski' into 'pol'. It's applied 
**only for query**.
   Here's what I see in quey analysis (token position in parenthesis):
   ```
   Tokenizer: test(1) polskie(2)
   MF: test(1) polskie(2) polski(2)
   SGF: test(1) polskie(2) pol(3) polski(3).
   ```
   When I search for "test polskie" with quotation marks, a document with the 
same text doesn't match, because SGF changes positions of tokens in query 
compared to index.
   
   In documentation, the description for the old `SynonymFilter` says "_The 
position value of the new tokens are set such they all occur at the same 
position as the original token._" In `SynonymGraphFilter` instead they are set 
to a position after the previous token. Is that an intentional change? Doesn't 
seem so, because it doesn't work as expected in my example.
   
   Looking at the code, it seems the problem is in 
https://github.com/apache/lucene/blob/main/lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/SynonymGraphFilter.java#L246:
   `nextNodeOut = lastNodeOut + posLenAtt.getPositionLength();`
   
   `nextNodeOut` is always set as the position after the current token, and 
that is later used as position of output token.
   I tried to remove this line and instead set this field right after the call 
to `input.incrementToken()`, in line 340:
   `nextNodeOut = lastNodeOut + posIncrAtt.getPositionIncrement();`
   This sets it to the original token's position. This way the final positions 
are `SGF: test(1) polskie(2) pol(2) polski(2).` and my document does match. I 
didn't experience any unexpected side effects.
   
   Hope this helps. I'm not familiar with the project enough to easilly submit 
a proper pull request, with tests and all.
   
   ### Version and environment details
   
   lucene 9.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to