rmuir commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450554452
> I have to admit that it chafes a little to convert everything to the "wrong" form, but the internal representation is just an internal representation, I guess, as long as everything is consistent about it. It is a good point, representation isn't always internal if we consider external libraries and other things that lucene can integrate with for language processing. As another datapoint, I investigated Romanian hunspell dictionary (as a user could substitute HunspellStemFilter for SnowballFilter if they want dictionary-based stemming, and because I was curious) They are using `ș` and `ț` , but have directives at the end to handle differences during suggestions phase: ``` MAP sşș MAP tţț ``` So I think the most ideal situation would be to both fix snowball and then map cedilla to "correct" forms with a TokenFilter? I realize this doesn't solve your problem immediately, but if PR gets accepted to snowball, I can just bump our git revision to the new one and we'll have the new stemmer :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org