rmuir commented on issue #14659:
URL: https://github.com/apache/lucene/issues/14659#issuecomment-2906907266

   If you re-read the description, I think you'll understand why i responded 
the way I did.
   To me it reads as, there isn't an understanding of the purpose of this 
filter, or the reasons why text could have these problems. It comes across as 
"I'm linguist, I'm native speaker, these characters are different, this is 
wrong!!!" without any actual data/homework done, and it leaves all the 
"homework" to the maintainers. 
   
   This isn't meant as an attack on you, don't take it the wrong way, i'm just 
stating how I read it. There was never a shortage of native speakers here, 
instead a shortage of correct unicode :)
   
   If the issue was written differently (this is just an EXAMPLE), it would 
allow making progress and improvements without draining a lot of time:
   
   EXAMPLE:
   
   Computerized text in this language has advanced in the last decade: e.g. 
content is now generally in (correct) Unicode, you don't have to download 
custom fonts from websites to read the text, nor are they rendering text as 
images/PDF, nor are they doing janky conversion from 8-bit fonts. Client OS can 
render it properly, e.g. Uniscribe renders complex scripts on Windows without 
checking special boxes or installing language packs, mobile phones work 
correctly etc. I did quick-n-dirty basic analysis with wget and regular 
expressions of a sample of common government/news sites, and confirmed text 
generally has correct unicode: we can tone it down. As a safe step, first 
remove too-aggressive rules (e.g. that conflate different consonants), these 
cause more harm than good for "good text".
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to