Jim Ferenczi created LUCENE-10081:
-------------------------------------

             Summary: KoreanTokenizer should check the max backtrace gap on 
whitespaces
                 Key: LUCENE-10081
                 URL: https://issues.apache.org/jira/browse/LUCENE-10081
             Project: Lucene - Core
          Issue Type: Bug
            Reporter: Jim Ferenczi


Today the KoreanTokenizer keeps track of the whitespaces that appear before a 
known term in order to apply a space penalty factor. These whitespaces are 
considered part of the next term so the backtrace gap limit is not applied. 
As a result, the position buffer can grow up to the maximum number of 
consecutive whitespaces in the input. This is problematic since the buffer is 
reused on reset() so we should ensure that the max backtrace gap limit is 
applied on consecutive whitespaces consistently.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to