Re: [I] Remove Telugu normalization of vu వు to ma మ from IndicNormalizer [lucene]

via GitHub Sun, 25 May 2025 02:05:38 -0700


praveen-d291 commented on issue #14659:
URL: https://github.com/apache/lucene/issues/14659#issuecomment-2907706053


   @rmuir,
   
   You're absolutely right; I should have led with this data in my initial 
comment. My apologies for not providing the "homework" upfront.
   
   Here's a direct look at the state of modern Telugu content, which strongly 
suggests that the issues the IndicNormalizationFilter was designed to address 
are less prevalent now:
   
   1. **Prevalence of Clean Unicode Text**:
   I've analyzed several high-volume, real-world Telugu sources, and the trend 
towards clean Unicode is very clear across these examples:
   
   
   - The official website of the Government of Telangana: 
https://www.telangana.gov.in/te/
   -  The Andhra Pradesh Government's Irrigation Department website: 
https://irrigationap.cgg.gov.in/wrd/home
   -  The Andhra Pradesh Agriculture Department website: 
https://www.apagrisnet.gov.in/
   - A major Telugu news publication like Eenadu: https://www.eenadu.net/ 
(consistently a top 3 paper by circulation).
   
   All content on these sites consistently uses UTF-8 Unicode. Characters like 
వు (vu) and మ (ma) are rendered distinctly and unambiguously.
   
   2. **Widespread OS-Level Font Support**:
   The need for "custom fonts from websites" or "janky conversion" is largely 
gone because popular OS vendors have been bundling robust Telugu font support 
for over two decades:
   
   **Windows**: Gautami has been included since 2001 
(https://en.wikipedia.org/wiki/Gautami_(typeface)). Nirmala UI, a comprehensive 
typeface for Indic scripts, has been bundled since Windows 8 
(https://en.wikipedia.org/wiki/Nirmala_UI).
   **macOS**: macOS Monterey alone includes 15 Telugu fonts (Apple support 
page: https://support.apple.com/en-in/103203).
   This widespread, native OS support directly translates to users generally 
not dealing with systems that require special handling or struggle with complex 
script rendering for modern Unicode Telugu text.
   
   The core issue is that applying the వు to మ conflation by default now 
introduces a linguistically incorrect loss of precision for the vast majority 
of current Telugu content. Given this, I want to reiterate the two options I 
proposed earlier for addressing this:
   
   Option 1: Fix the Default (My Preference)
   I'd propose adding a boolean option to the TeluguAnalyzer constructor to 
control IndicNormalizationFilter inclusion, and make its default false. This 
would make TeluguAnalyzer precise right out of the box for modern documents. 
Users with older, less-formatted text could still explicitly enable it. I 
believe this is a necessary correction for linguistic accuracy and explicitly 
documents this conversion.
   
   Option 2: Document the behavior in TeluguAnalyzer
   Alternatively, we could document this specific behavior in the 
TeluguAnalyzer docs, explaining the వు to మ mapping and how to build a custom 
analyzer to avoid it.
   
   Option 1 feels like the right long-term fix for the default user experience, 
given the current state of Telugu content. What do you think? I can raise a PR 
after agreeing on this topic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Remove Telugu normalization of vu వు to ma మ from IndicNormalizer [lucene]

Reply via email to