Re: [PR] TruncateTokenFilter truncates safely when the final char is a surrogate pair [lucene]

via GitHub Thu, 02 Apr 2026 01:29:00 -0700


uschindler commented on PR #15900:
URL: https://github.com/apache/lucene/pull/15900#issuecomment-4175578234


   In addition, this filter is not broken, so I kindly disagree with this PR. 
If it truncates tokens after n utf16 characters, so you may get half 
surrogates. That's expected. This is exactly the same like if you use 
`String#substring(n,m)` in your Java code - the same half surrogates may 
happen, too. You have to be prepared for that. Changing this filter would break 
with the contract and I also don't like that crazy code to filter empty tokens. 
That's not applicable here.
   
   So I would keep that filter here as is and just mention in the Javadocs that 
it counts Java chars (utf16) and truncates on char - not codepoint - boundaries.
   
   We should create the missing `CodepointTruncateFilter`. The code would be 
similar to the `CodePointCountFilter` (the optimizations in it to sort out 
tokens which don't need exact codepoints counted can be reused), but when a 
token really needs to be truncated after n codepoints, it is slower than 
current filter, but for most tokens it will be as fast.
   
   Should I open a new PR for the filter?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] TruncateTokenFilter truncates safely when the final char is a surrogate pair [lucene]

Reply via email to