Trey314159 commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1454132622
I looked into fixing the stemmer, and it got a little complicated, so if you want to have a go at it, please do. Otherwise we can try opening an issue on their repo. One problem I ran into is that they are building an ISO 8859-2 version of the stemmer, and the comma-based characters are not available. I'm not sure what the right path is, dropping ISO 8859-2 support, forking the stemmer into ISO 8859-2 and Unicode versions, or something else entirely. As for your stats, that jives with what I found: ~350x more _tokens_ of "și" ("and") with comma than "şi" with cedilla, and ~50x more _articles_ with the comma variant. The tide has definitely shifted in favor of the comma variant, at least on Romanian Wikipedia. BTW, if anyone needs example words to test the stemmer, here are a pair for each s and t, where the comma/cedilla character needs to be recognized properly to be stemmed properly: prelungeşte/prelungește and apreciaţi/apreciați. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org