Re: [PR] [Draft] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]

via GitHub Thu, 07 Dec 2023 09:54:00 -0800


msfroh commented on PR #12885:
URL: https://github.com/apache/lucene/pull/12885#issuecomment-1845843504


   This is really interesting. It looks like the filter logic is already trying 
to conversion to katakana before converting to romaji.
   
   Specifically in 
https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java#L49-L63,
 my take is:
   
   1. The `reading` variable gets populated with (what I understand should be) 
the katakana reading (which gets returned if `useRomaji` is `false`).
   2. Assuming that reading is populated and `useRomaji` is `true`, then we 
convert the katakana to romaji.
   
   So, I'm wondering if maybe there's a bug in `JaMorphData.getReading()` 
implementation? It looks like there's already supposed to be a hiragana -> 
katakana shift here: 
https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoMorphData.java#L115
   
   (This is my first time reading this code, back by my very limited, 
Duolingo-based knowledge of Japanese, so I might be wrong.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] [Draft] Fix for the bug where JapaneseReadingFormFilter cannot convert some hiragana to romaji [lucene]

Reply via email to