rmuir commented on PR #12172:
URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450554452

   > I have to admit that it chafes a little to convert everything to the 
"wrong" form, but the internal representation is just an internal 
representation, I guess, as long as everything is consistent about it.
   
   It is a good point, representation isn't always internal if we consider 
external libraries and other things that lucene can integrate with for language 
processing.
   
   As another datapoint, I investigated Romanian hunspell dictionary (as a user 
could substitute HunspellStemFilter for SnowballFilter if they want 
dictionary-based stemming, and because I was curious)
   
   They are using `ș` and `ț` , but have directives at the end to handle 
differences during suggestions phase:
   ```
   MAP sşș
   MAP tţț
   ```
   
   So I think the most ideal situation would be to both fix snowball and then 
map cedilla to "correct" forms with a TokenFilter? I realize this doesn't solve 
your problem immediately, but if PR gets accepted to snowball, I can just bump 
our git revision to the new one and we'll have the new stemmer :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to