msfroh commented on PR #12885: URL: https://github.com/apache/lucene/pull/12885#issuecomment-1845843504
This is really interesting. It looks like the filter logic is already trying to conversion to katakana before converting to romaji. Specifically in https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/JapaneseReadingFormFilter.java#L49-L63, my take is: 1. The `reading` variable gets populated with (what I understand should be) the katakana reading (which gets returned if `useRomaji` is `false`). 2. Assuming that reading is populated and `useRomaji` is `true`, then we convert the katakana to romaji. So, I'm wondering if maybe there's a bug in `JaMorphData.getReading()` implementation? It looks like there's already supposed to be a hiragana -> katakana shift here: https://github.com/apache/lucene/blob/main/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoMorphData.java#L115 (This is my first time reading this code, back by my very limited, Duolingo-based knowledge of Japanese, so I might be wrong.) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org