rmuir commented on issue #11976: URL: https://github.com/apache/lucene/issues/11976#issuecomment-1328150137
I debugged the issue, the problem is not this particular charfilter, instead the issue impacts all charfilters. Think about this single-character string: "㋀" Our charfilter turns it into two characters: "1" and "月" we would expect the offsets to look like this: ``` first token "1" at rawStartOffset=0, rawEndOffset=1 -> startOffset=0, endOffset=1 correctOffset(0) -> 0 correctOffset(1) -> 1 second token "月" at rawStartOffset=1, rawEndOffset=2 -> startOffset=0, endOffset=1 correctOffset(1) -> 0 correctOffset(2) -> 1 ``` As you can see, the bug is in the whole charfilter api of "correctOffset". Because we need `correctOffset(1) -> 1` for the endoffset of the first token, but we need `correctOffset(1) -> 0` for the start offset of the second token. I can't see any way to fix this, without fixing actual charfilter api (e.g. supporting two separate methods: `correctStartOffset()` and `correctEndOffset()`) Sorry for the bad example/explanation. Another example would be a charfilter that converts `æ` to `ae`. a's endoffset of 1 needs to remain 1 after correction, but e's startoffset of 1 needs to be corrected to a 0. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org