praveen-d291 commented on issue #14659: URL: https://github.com/apache/lucene/issues/14659#issuecomment-2907706053
@rmuir, You're absolutely right; I should have led with this data in my initial comment. My apologies for not providing the "homework" upfront. Here's a direct look at the state of modern Telugu content, which strongly suggests that the issues the IndicNormalizationFilter was designed to address are less prevalent now: 1. **Prevalence of Clean Unicode Text**: I've analyzed several high-volume, real-world Telugu sources, and the trend towards clean Unicode is very clear across these examples: - The official website of the Government of Telangana: https://www.telangana.gov.in/te/ - The Andhra Pradesh Government's Irrigation Department website: https://irrigationap.cgg.gov.in/wrd/home - The Andhra Pradesh Agriculture Department website: https://www.apagrisnet.gov.in/ - A major Telugu news publication like Eenadu: https://www.eenadu.net/ (consistently a top 3 paper by circulation). All content on these sites consistently uses UTF-8 Unicode. Characters like వు (vu) and మ (ma) are rendered distinctly and unambiguously. 2. **Widespread OS-Level Font Support**: The need for "custom fonts from websites" or "janky conversion" is largely gone because popular OS vendors have been bundling robust Telugu font support for over two decades: **Windows**: Gautami has been included since 2001 (https://en.wikipedia.org/wiki/Gautami_(typeface)). Nirmala UI, a comprehensive typeface for Indic scripts, has been bundled since Windows 8 (https://en.wikipedia.org/wiki/Nirmala_UI). **macOS**: macOS Monterey alone includes 15 Telugu fonts (Apple support page: https://support.apple.com/en-in/103203). This widespread, native OS support directly translates to users generally not dealing with systems that require special handling or struggle with complex script rendering for modern Unicode Telugu text. The core issue is that applying the వు to మ conflation by default now introduces a linguistically incorrect loss of precision for the vast majority of current Telugu content. Given this, I want to reiterate the two options I proposed earlier for addressing this: Option 1: Fix the Default (My Preference) I'd propose adding a boolean option to the TeluguAnalyzer constructor to control IndicNormalizationFilter inclusion, and make its default false. This would make TeluguAnalyzer precise right out of the box for modern documents. Users with older, less-formatted text could still explicitly enable it. I believe this is a necessary correction for linguistic accuracy and explicitly documents this conversion. Option 2: Document the behavior in TeluguAnalyzer Alternatively, we could document this specific behavior in the TeluguAnalyzer docs, explaining the వు to మ mapping and how to build a custom analyzer to avoid it. Option 1 feels like the right long-term fix for the default user experience, given the current state of Telugu content. What do you think? I can raise a PR after agreeing on this topic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org