Trey314159 commented on issue #14659:
URL: https://github.com/apache/lucene/issues/14659#issuecomment-2902054358

   @praveen-d291: Thanks for the pull request! I was unsure how best to modify 
the tests since I don't read Telugu. I couldn't tell what would make 
natural-looking examples and I didn't want to further impose on the native 
speaker I have been working with to look at unit tests, so thank you for 
putting your overlapping knowledge of Telugu and Java to good use for others. I 
hope someone approves it!
   
   @rmuir: You have used "working as documented" as a thought-terminating 
cliché before. Just because something is accurately documented doesn't mean it 
is the right thing to do. 
   
   The referenced document is also no longer at the URL given in the code—and, 
based on the Wayback Machine, hasn't been for years—which will keep many people 
from finding and referencing it. However, I did find the new location of the 
current version (also thanks to the Wayback Machine):
   > http://languagelog.ldc.upenn.edu/myl/ldc/IndianScriptsUnicode.html
   
   A few thoughts:
   * Is text written in legacy fonts, extracted from PDF, etc. the most common 
use case for Telugu text indexed by Lucene these days? I get that the specific 
mapping improves recall for poorly curated text, but it does so at the cost of 
precision. Both Praveen above and the native speaker I've been working with 
don't seem to think this one mapping is useful. I originially questioned it 
because I—as a moderately attentive non-speaker—can see the difference between 
the characters in all but one of the Telugu-capabale fonts I have, and in all 
of the Telugu-specific fonts I have—Arial Unicode being the one where it is 
visually ambiguous. That's very different from other mappings like బ + ు + ు 
(బుు) → ఋ, where there is no visual distinction between the result, and the 
non-canonical version makes no sense on its own ("buu" should be బ + ూ = బూ).
   * According to the referenced document's History section, it hasn't been 
updated since 1998. Technology moves fast, so it seems reasonable to review 
data sources and assumptions about content at least once every quarter century.
   * Also, if you read the referenced document _carefully,_ the relevant 
mapping seems to only be included for some expansive notion of completeness: 
it's parenthesized unlike any other mapping, and **the associated comment in 
the referenced document explicitly says that that "MA [0C2E] _will not_ be 
confused" with VU [0C35+0C41] because there is special rendering to make them 
distinct** (modulo Arial Unicode). You should be able to see the referenced 
difference in rendering in the title of this ticket.
   
   ```
   (U+0C2E     0C35 0C41           TELUGU LETTER MA will not be confused,
                                   as the script uses a special rendering
                                   of 0C41 in this case. The same is
                                   done in several other appearant cases.)
   ```
   
   I read that as indicating that including the VU/MA mapping is an error. 
There may be other cases, as the comment suggests, but this one is 
high-frequency enough that it bubbled to the top in my anaysis of our content, 
across several languages.
   
   "Don't use it if you don't like it" is another thought-terminating cliché 
you've used before. I have the wherewithal to do exactly that, but that's not 
why I'm here. I'm lucky to have the time and ability to do a detailed analysis 
of the effects of the components of various language analyzers on our content, 
which is often varied and voluminous, and I can usually find willing native 
speakers to help me untangle the more questionable or confusing bits. I can 
also write my own plugins, configure custom filters, and tweak anything and 
everything I need to. Not every organization using Lucene has the ability to do 
that, so I try to upstream generally applicable knowledge or improvements for 
the users who don't have the time, technical skill, and language knowledge 
needed to customize their own deployments.
   
   Even though I _can,_ forking and/or re-implementing the 99+% of 
`indic_normalization` that does good things is a brittle approach that cuts me 
off from future improvements and upgrades, and adds an unneeded maintenance 
burden to my deployment. I'd rather try to improve `indic_normalization` for 
everyone, or at least have a conversation about current vs historical trends in 
computing and content for the relevant language/script,[*] think about the 
trade-offs of recall and precision for the particular mapping, incorporate 
thoughts and insights from speakers of the language, and improve everyone's 
understanding of the current needs and wants of searchers and readers.
   
   > [*] I'd appreciate a link to the FIRE IR benchmarks you had in mind. A 
quick online search only revealed discussion of a Telugu Named Entity 
Recognition dataset.
   
   Instead, user received another abrasive and dismissive termination of 
discussion, which reminded user why user has not always shared[†] other 
generally applicable knowledge or ideas for improvements. 
   
   > [†] For example, user has [previously 
noted](https://www.mediawiki.org/wiki/User%3ATJones_%28WMF%29%2FNotes%2FUnpacking_Notes%2FBengali%23Double_the_Metaphone%2C_Double_the_Fun%28etics%29)
 that `bengali_normalization` uses a phonetic algorithm with much too 
aggressive compression for search, and verified this fact with native Bangla 
speakers. User recognizes that it works as documented, and since user did not 
like it, user does not use it—as user would expect to be advised. However, user 
felt bad for not trying to upstream this information to improve Bangla search 
for others. Now user feels less bad because the attempt would also likely have 
been rejected.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to