rmuir commented on PR #12172: URL: https://github.com/apache/lucene/pull/12172#issuecomment-1450458707
I think we can merge this stopword list change anyway. But I think a filter may be worthwhile as a separate PR? It has the advantage of making the terms conflate regardless of which variant of the character is being used. After reading up on the history of these characters, I think we should treat them "the same" for Romanian always. It will allow queries/documents to match in all cases. So I don't think of it as a hack, as the "different characters" may impact words outside of just stopwords and suffixes that are stemmed, too. I'd recommend doing it as a TokenFilter. Many lucene analyzers for other languages have such normalizers as tokenfilters to deal with similar issues. CharFilter is not needed in this case as it won't impact tokenization, since StandardTokenizer etc will tokenize them all the same way (same unicode wordbreak properties). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org