Trey314159 commented on issue #14659: URL: https://github.com/apache/lucene/issues/14659#issuecomment-2902054358
@praveen-d291: Thanks for the pull request! I was unsure how best to modify the tests since I don't read Telugu. I couldn't tell what would make natural-looking examples and I didn't want to further impose on the native speaker I have been working with to look at unit tests, so thank you for putting your overlapping knowledge of Telugu and Java to good use for others. I hope someone approves it! @rmuir: You have used "working as documented" as a thought-terminating cliché before. Just because something is accurately documented doesn't mean it is the right thing to do. The referenced document is also no longer at the URL given in the code—and, based on the Wayback Machine, hasn't been for years—which will keep many people from finding and referencing it. However, I did find the new location of the current version (also thanks to the Wayback Machine): > http://languagelog.ldc.upenn.edu/myl/ldc/IndianScriptsUnicode.html A few thoughts: * Is text written in legacy fonts, extracted from PDF, etc. the most common use case for Telugu text indexed by Lucene these days? I get that the specific mapping improves recall for poorly curated text, but it does so at the cost of precision. Both Praveen above and the native speaker I've been working with don't seem to think this one mapping is useful. I originially questioned it because I—as a moderately attentive non-speaker—can see the difference between the characters in all but one of the Telugu-capabale fonts I have, and in all of the Telugu-specific fonts I have—Arial Unicode being the one where it is visually ambiguous. That's very different from other mappings like బ + ు + ు (బుు) → ఋ, where there is no visual distinction between the result, and the non-canonical version makes no sense on its own ("buu" should be బ + ూ = బూ). * According to the referenced document's History section, it hasn't been updated since 1998. Technology moves fast, so it seems reasonable to review data sources and assumptions about content at least once every quarter century. * Also, if you read the referenced document _carefully,_ the relevant mapping seems to only be included for some expansive notion of completeness: it's parenthesized unlike any other mapping, and **the associated comment in the referenced document explicitly says that that "MA [0C2E] _will not_ be confused" with VU [0C35+0C41] because there is special rendering to make them distinct** (modulo Arial Unicode). You should be able to see the referenced difference in rendering in the title of this ticket. ``` (U+0C2E 0C35 0C41 TELUGU LETTER MA will not be confused, as the script uses a special rendering of 0C41 in this case. The same is done in several other appearant cases.) ``` I read that as indicating that including the VU/MA mapping is an error. There may be other cases, as the comment suggests, but this one is high-frequency enough that it bubbled to the top in my anaysis of our content, across several languages. "Don't use it if you don't like it" is another thought-terminating cliché you've used before. I have the wherewithal to do exactly that, but that's not why I'm here. I'm lucky to have the time and ability to do a detailed analysis of the effects of the components of various language analyzers on our content, which is often varied and voluminous, and I can usually find willing native speakers to help me untangle the more questionable or confusing bits. I can also write my own plugins, configure custom filters, and tweak anything and everything I need to. Not every organization using Lucene has the ability to do that, so I try to upstream generally applicable knowledge or improvements for the users who don't have the time, technical skill, and language knowledge needed to customize their own deployments. Even though I _can,_ forking and/or re-implementing the 99+% of `indic_normalization` that does good things is a brittle approach that cuts me off from future improvements and upgrades, and adds an unneeded maintenance burden to my deployment. I'd rather try to improve `indic_normalization` for everyone, or at least have a conversation about current vs historical trends in computing and content for the relevant language/script,[*] think about the trade-offs of recall and precision for the particular mapping, incorporate thoughts and insights from speakers of the language, and improve everyone's understanding of the current needs and wants of searchers and readers. > [*] I'd appreciate a link to the FIRE IR benchmarks you had in mind. A quick online search only revealed discussion of a Telugu Named Entity Recognition dataset. Instead, user received another abrasive and dismissive termination of discussion, which reminded user why user has not always shared[†] other generally applicable knowledge or ideas for improvements. > [†] For example, user has [previously noted](https://www.mediawiki.org/wiki/User%3ATJones_%28WMF%29%2FNotes%2FUnpacking_Notes%2FBengali%23Double_the_Metaphone%2C_Double_the_Fun%28etics%29) that `bengali_normalization` uses a phonetic algorithm with much too aggressive compression for search, and verified this fact with native Bangla speakers. User recognizes that it works as documented, and since user did not like it, user does not use it—as user would expect to be advised. However, user felt bad for not trying to upstream this information to improve Bangla search for others. Now user feels less bad because the attempt would also likely have been rejected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org