Re: [I] Remove Telugu normalization of vu వు to ma మ from IndicNormalizer [lucene]

via GitHub Thu, 22 May 2025 10:39:47 -0700


Trey314159 commented on issue #14659:
URL: https://github.com/apache/lucene/issues/14659#issuecomment-2902054358

@praveen-d291: Thanks for the pull request! I was unsure how best to modify
the tests since I don't read Telugu. I couldn't tell what would make
natural-looking examples and I didn't want to further impose on the native
speaker I have been working with to look at unit tests, so thank you for
putting your overlapping knowledge of Telugu and Java to good use for others. I
hope someone approves it!

@rmuir: You have used "working as documented" as a thought-terminating
cliché before. Just because something is accurately documented doesn't mean it
is the right thing to do.

The referenced document is also no longer at the URL given in the code—and,
based on the Wayback Machine, hasn't been for years—which will keep many people
from finding and referencing it. However, I did find the new location of the
current version (also thanks to the Wayback Machine):
> http://languagelog.ldc.upenn.edu/myl/ldc/IndianScriptsUnicode.html

A few thoughts:
* Is text written in legacy fonts, extracted from PDF, etc. the most common
use case for Telugu text indexed by Lucene these days? I get that the specific
mapping improves recall for poorly curated text, but it does so at the cost of
precision. Both Praveen above and the native speaker I've been working with
don't seem to think this one mapping is useful. I originially questioned it
because I—as a moderately attentive non-speaker—can see the difference between
the characters in all but one of the Telugu-capabale fonts I have, and in all
of the Telugu-specific fonts I have—Arial Unicode being the one where it is
visually ambiguous. That's very different from other mappings like బ + ు + ు
(బుు) → ఋ, where there is no visual distinction between the result, and the
non-canonical version makes no sense on its own ("buu" should be బ + ూ = బూ).
* According to the referenced document's History section, it hasn't been
updated since 1998. Technology moves fast, so it seems reasonable to review
data sources and assumptions about content at least once every quarter century.
* Also, if you read the referenced document _carefully,_ the relevant
mapping seems to only be included for some expansive notion of completeness:
it's parenthesized unlike any other mapping, and **the associated comment in
the referenced document explicitly says that that "MA [0C2E] _will not_ be
confused" with VU [0C35+0C41] because there is special rendering to make them
distinct** (modulo Arial Unicode). You should be able to see the referenced
difference in rendering in the title of this ticket.

```
(U+0C2E 0C35 0C41 TELUGU LETTER MA will not be confused,
as the script uses a special rendering
of 0C41 in this case. The same is
done in several other appearant cases.)
```

I read that as indicating that including the VU/MA mapping is an error.
There may be other cases, as the comment suggests, but this one is
high-frequency enough that it bubbled to the top in my anaysis of our content,
across several languages.

"Don't use it if you don't like it" is another thought-terminating cliché
you've used before. I have the wherewithal to do exactly that, but that's not
why I'm here. I'm lucky to have the time and ability to do a detailed analysis
of the effects of the components of various language analyzers on our content,
which is often varied and voluminous, and I can usually find willing native
speakers to help me untangle the more questionable or confusing bits. I can
also write my own plugins, configure custom filters, and tweak anything and
everything I need to. Not every organization using Lucene has the ability to do
that, so I try to upstream generally applicable knowledge or improvements for
the users who don't have the time, technical skill, and language knowledge
needed to customize their own deployments.

Even though I _can,_ forking and/or re-implementing the 99+% of
`indic_normalization` that does good things is a brittle approach that cuts me
off from future improvements and upgrades, and adds an unneeded maintenance
burden to my deployment. I'd rather try to improve `indic_normalization` for
everyone, or at least have a conversation about current vs historical trends in
computing and content for the relevant language/script,[*] think about the
trade-offs of recall and precision for the particular mapping, incorporate
thoughts and insights from speakers of the language, and improve everyone's
understanding of the current needs and wants of searchers and readers.

> [*] I'd appreciate a link to the FIRE IR benchmarks you had in mind. A
quick online search only revealed discussion of a Telugu Named Entity
Recognition dataset.

Instead, user received another abrasive and dismissive termination of
discussion, which reminded user why user has not always shared[†] other
generally applicable knowledge or ideas for improvements.

> [†] For example, user has [previously
noted](https://www.mediawiki.org/wiki/User%3ATJones_%28WMF%29%2FNotes%2FUnpacking_Notes%2FBengali%23Double_the_Metaphone%2C_Double_the_Fun%28etics%29)
that `bengali_normalization` uses a phonetic algorithm with much too
aggressive compression for search, and verified this fact with native Bangla
speakers. User recognizes that it works as documented, and since user did not
like it, user does not use it—as user would expect to be advised. However, user
felt bad for not trying to upstream this information to improve Bangla search
for others. Now user feels less bad because the attempt would also likely have
been rejected.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Remove Telugu normalization of vu వు to ma మ from IndicNormalizer [lucene]

Reply via email to