rmuir commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1452764003
I think its actually quite easy to fix the stemmer if we want to just send them a pull request. I can help if you don't want to do it, I don't want to steal your thunder though :) Thanks again for bringing this to our attention. It really looks like a serious issue. I downloaded a dump of Romanian wikipedia (100K+ articles) and generated some stats with https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=UnicodeCharacterCount: ``` proper comma forms U+0218 Ș 129164 U+0219 ș 1602600 U+021A Ț 21578 U+021B ț 1088506 old cedilla forms U+015E Ş 1007 U+015F ş 34008 U+0162 Ţ 465 U+0163 ţ 52129 ``` Currently "old cedilla forms" are the only ones working correctly with lucene stemmer/stopwords :( -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org