Trey314159 commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450316155
_Good catch!_ I didn't consider that the stemmer might also be of a vintage to only use the older orthography. I've contacted the Snowball mailing list (message not yet accepted) to suggest making the stemmer aware of both variants of each character. A character filter or token filter to convert the comma forms to cedilla forms seems like a reasonable hack for now, and I'll add that to our local implementation right away. I have to admit that it chafes a little to convert everything to the "wrong" form, but the internal representation is just an internal representation, I guess, as long as everything is consistent about it. Feel free to decline this pull request if that makes more sense to you, though I think it might be a good bit of future proofing for each stage of the process to handle both forms, so as not to be reliant on a counterintuitive character mapping. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org