uschindler commented on PR #15900: URL: https://github.com/apache/lucene/pull/15900#issuecomment-4175578234
In addition, this filter is not broken, so I kindly disagree with this PR. If it truncates tokens after n utf16 characters, so you may get half surrogates. That's expected. This is exactly the same like if you use `String#substring(n,m)` in your Java code - the same half surrogates may happen, too. You have to be prepared for that. Changing this filter would break with the contract and I also don't like that crazy code to filter empty tokens. That's not applicable here. So I would keep that filter here as is and just mention in the Javadocs that it counts Java chars (utf16) and truncates on char - not codepoint - boundaries. We should create the missing `CodepointTruncateFilter`. The code would be similar to the `CodePointCountFilter` (the optimizations in it to sort out tokens which don't need exact codepoints counted can be reused), but when a token really needs to be truncated after n codepoints, it is slower than current filter, but for most tokens it will be as fast. Should I open a new PR for the filter? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
