uschindler commented on PR #15900:
URL: https://github.com/apache/lucene/pull/15900#issuecomment-4175882033

   Here is the updated variant:
   
   ```java
   public final class TruncateTokenFilter extends TokenFilter {
   
     private final CharTermAttribute termAttribute = 
addAttribute(CharTermAttribute.class);
     private final KeywordAttribute keywordAttr = 
addAttribute(KeywordAttribute.class);
   
     private final int codePointLength;
   
     public TruncateTokenFilter(TokenStream input, int length) {
       super(input);
       if (length < 1)
         throw new IllegalArgumentException("length parameter must be a 
positive number: " + length);
       this.codePointLength = length;
     }
   
     @Override
     public final boolean incrementToken() throws IOException {
       if (input.incrementToken()) {
         if (keywordAttr.isKeyword()) {
           return true;
         }
         if (termAttribute.length() <= codePointLength) {
           // the term is short enough in utf-16 chars, so we do not need to 
modify it
           return true;
         }
         try {
           // we must count to be sure
           int truncateAtChar =
               Character.offsetByCodePoints(
                   termAttribute.buffer(), 0, termAttribute.length(), 0, 
codePointLength);
           termAttribute.setLength(truncateAtChar);
         } catch (IndexOutOfBoundsException _) {
           // the term is short enough
         }
         return true;
       } else {
         return false;
       }
     }
   }
   ```
   
   This code catches the `IndexOutOfBoundsException` as this is cheaper than 
validating the arguments first (it may happen that the term only consists of 
surrogates and its utf16 length is larger than the maximum length.
   
   It passes all tests I created although the random one. So let's decide how 
to proceed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to