Trey314159 commented on PR #12172:
URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450316155

   _Good catch!_ I didn't consider that the stemmer might also be of a vintage 
to only use the older orthography. I've contacted the Snowball mailing list 
(message not yet accepted) to suggest making the stemmer aware of both variants 
of each character. A character filter or token filter to convert the comma 
forms to cedilla forms seems like a reasonable hack for now, and I'll add that 
to our local implementation right away. I have to admit that it chafes a little 
to convert everything to the "wrong" form, but the internal representation is 
just an internal representation, I guess, as long as everything is consistent 
about it.
   
   Feel free to decline this pull request if that makes more sense to you, 
though I think it might be a good bit of future proofing for each stage of the 
process to handle both forms, so as not to be reliant on a counterintuitive 
character mapping.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to