rmuir commented on PR #12172:
URL: https://github.com/apache/lucene/pull/12172#issuecomment-1452764003

   I think its actually quite easy to fix the stemmer if we want to just send 
them a pull request. I can help if you don't want to do it, I don't want to 
steal your thunder though :)
   
   Thanks again for bringing this to our attention. It really looks like a 
serious issue. I downloaded a dump of Romanian wikipedia (100K+ articles) and 
generated some stats with 
https://scripts.sil.org/cms/scripts/page.php?site_id=nrsi&id=UnicodeCharacterCount:
   
   ```
   proper comma forms
           U+0218  Ș       129164
           U+0219  ș       1602600
           U+021A  Ț       21578
           U+021B  ț       1088506
   
   old cedilla forms
           U+015E  Ş       1007
           U+015F  ş       34008
           U+0162  Ţ       465
           U+0163  ţ       52129
   ```
   
   Currently "old cedilla forms" are the only ones working correctly with 
lucene stemmer/stopwords :(


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to