rmuir commented on issue #14659: URL: https://github.com/apache/lucene/issues/14659#issuecomment-2906907266
If you re-read the description, I think you'll understand why i responded the way I did. To me it reads as, there isn't an understanding of the purpose of this filter, or the reasons why text could have these problems. It comes across as "I'm linguist, I'm native speaker, these characters are different, this is wrong!!!" without any actual data/homework done, and it leaves all the "homework" to the maintainers. This isn't meant as an attack on you, don't take it the wrong way, i'm just stating how I read it. There was never a shortage of native speakers here, instead a shortage of correct unicode :) If the issue was written differently (this is just an EXAMPLE), it would allow making progress and improvements without draining a lot of time: EXAMPLE: Computerized text in this language has advanced in the last decade: e.g. content is now generally in (correct) Unicode, you don't have to download custom fonts from websites to read the text, nor are they rendering text as images/PDF, nor are they doing janky conversion from 8-bit fonts. Client OS can render it properly, e.g. Uniscribe renders complex scripts on Windows without checking special boxes or installing language packs, mobile phones work correctly etc. I did quick-n-dirty basic analysis with wget and regular expressions of a sample of common government/news sites, and confirmed text generally has correct unicode: we can tone it down. As a safe step, first remove too-aggressive rules (e.g. that conflate different consonants), these cause more harm than good for "good text". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org