rmuir commented on PR #12172:
URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450458707

   I think we can merge this stopword list change anyway. But I think a filter 
may be worthwhile as a separate PR?
   
   It has the advantage of making the terms conflate regardless of which 
variant of the character is being used. After reading up on the history of 
these characters, I think we should treat them "the same" for Romanian always. 
It will allow queries/documents to match in all cases. So I don't think of it 
as a hack, as the "different characters" may impact words outside of just 
stopwords and suffixes that are stemmed, too.
   
   I'd recommend doing it as a TokenFilter. Many lucene analyzers for other 
languages have such normalizers as tokenfilters to deal with similar issues. 
CharFilter is not needed in this case as it won't impact tokenization, since 
StandardTokenizer etc will tokenize them all the same way (same unicode 
wordbreak properties).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to