uschindler commented on PR #15900:
URL: https://github.com/apache/lucene/pull/15900#issuecomment-4175882033
Here is the updated variant:
```java
public final class TruncateTokenFilter extends TokenFilter {
private final CharTermAttribute termAttribute =
addAttribute(CharTermAttribute.class);
private final KeywordAttribute keywordAttr =
addAttribute(KeywordAttribute.class);
private final int codePointLength;
public TruncateTokenFilter(TokenStream input, int length) {
super(input);
if (length < 1)
throw new IllegalArgumentException("length parameter must be a
positive number: " + length);
this.codePointLength = length;
}
@Override
public final boolean incrementToken() throws IOException {
if (input.incrementToken()) {
if (keywordAttr.isKeyword()) {
return true;
}
if (termAttribute.length() <= codePointLength) {
// the term is short enough in utf-16 chars, so we do not need to
modify it
return true;
}
try {
// we must count to be sure
int truncateAtChar =
Character.offsetByCodePoints(
termAttribute.buffer(), 0, termAttribute.length(), 0,
codePointLength);
termAttribute.setLength(truncateAtChar);
} catch (IndexOutOfBoundsException _) {
// the term is short enough
}
return true;
} else {
return false;
}
}
}
```
This code catches the `IndexOutOfBoundsException` as this is cheaper than
validating the arguments first (it may happen that the term only consists of
surrogates and its utf16 length is larger than the maximum length.
It passes all tests I created although the random one. So let's decide how
to proceed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]