Trey314159 commented on PR #12172:
URL: https://github.com/apache/lucene/pull/12172#issuecomment-1454132622

   I looked into fixing the stemmer, and it got a little complicated, so if you 
want to have a go at it, please do. Otherwise we can try opening an issue on 
their repo. One problem I ran into is that they are building an ISO 8859-2 
version of the stemmer, and the comma-based characters are not available. I'm 
not sure what the right path is, dropping ISO 8859-2 support, forking the 
stemmer into ISO 8859-2 and Unicode versions, or something else entirely.
   
   As for your stats, that jives with what I found: ~350x more _tokens_ of "și" 
("and") with comma than "şi" with cedilla, and ~50x more _articles_ with the 
comma variant. The tide has definitely shifted in favor of the comma variant, 
at least on Romanian Wikipedia.
   
   BTW, if anyone needs example words to test the stemmer, here are a pair for 
each s and t, where the comma/cedilla character needs to be recognized properly 
to be stemmed properly: prelungeşte/prelungește and apreciaţi/apreciați.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to