You could certainly do it that way if you wanted. The one point I would make here is that from a linguistic POV these are not synonyms but are the same term written in a different alphabet.
François On Mar 11, 2011, at 12:51 PM, Walter Underwood wrote: > Sounds more like generating synonyms than conflating everything to one set of > kana. > > Why not a filter that does that transliteration and adds a token at the some > position? > > wunder > > On Mar 11, 2011, at 9:49 AM, Tomás Fernández Löbbe wrote: > >> "the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ' >> or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration >> will miss results." >> Exactly, that's my problem, searching on a different alphabet than the one >> on which it was indexed a document. >> François, thank you for your help. Have you used the new ICU Filters? Do >> they work OK? (I know it doesn't do Kanji) >> >> Tomás >> >> 2011/3/11 François Schiettecatte <fschietteca...@gmail.com> >> >>> Good question about transliteration, the issue has to do with recall, for >>> example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana >>> respectively), not doing the transliteration will miss results. You will >>> find that the big search engines do the transliteration for you >>> automatically. This issue get even more complicated when you dig into >>> orthographic variation because Japanese orthography is very variable (ie >>> there is more than one way to write a 'word'), as is tokenization (ie there >>> is more than one way to tokenize it), see: >>> >>> http://www.cjk.org/cjk/reference/japvar.htm >>> >>> I have used the Basis Technology software in the past, it is very good, but >>> it is also very expensive. >>> >>> François >>> >>> On Mar 11, 2011, at 11:53 AM, Walter Underwood wrote: >>> >>>> Why not index it as-is? Solr can handle Unicode. >>>> >>>> Transliterating hiragana to katakana is a very weird idea. I cannot >>> imagine how that would help. >>>> >>>> You will need some sort of tokenization to find word boundaries. N-grams >>> work OK for search, but are really ugly for highlighting. >>>> >>>> As far as I know, there are no good-quality free tokenizers for Japanese. >>> Basis Technology sells Japanese support that works with Lucene and Solr. >>>> >>>> wunder >>>> >>>> On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote: >>>> >>>>> Tomás >>>>> >>>>> That wont really work, transliteration to Romaji works for individual >>> terms only so you would need to tokenize the Japanese prior to >>> transliteration. I am not sure what tool you plan to use for >>> transliteration, I have used ICU in the past and from what I can tell it >>> does not transliterates Kanji. Besides transliterating Kanji is debatable >>> for a variety of reasons. >>>>> >>>>> What I would suggest is that you transliterate Hiragana to Katakana, >>> leave the Kanji alone, and index/search using ngrams. If you want 'proper' >>> tokenization I would recommend Mecab. >>>>> >>>>> I have looked into this for a client and there is no clear cut solution. >>>>> >>>>> Cheers >>>>> >>>>> François >>>>> >>>>> >>>>> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote: >>>>> >>>>>> This question is probably not a completely Solr question but it's >>> related to >>>>>> it. I'm dealing with a Japanese Solr application in which I would like >>> to be >>>>>> able to search in any of the Japanese Alphabets. The content can also >>> be in >>>>>> any Japanese Alphabet. I've been thinking in this solution: Convert >>>>>> everything to roma-ji, on Index time and query time. >>>>>> For example: >>>>>> >>>>>> Indexing time: >>>>>> [Something in Hiragana] --> translate it to roma-ji --> index >>>>>> >>>>>> Searching time: >>>>>> [Something in Katakana] --> translate it to roma-ji --> search >>>>>> or >>>>>> [Something in Kanji] --> translate it to roma-ji --> search >>>>>> >>>>>> I don't have a deep understanding of Japanese, and that's my problem. >>> Did >>>>>> somebody in the list tried something like this before? Did it work? >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Tomás >>>>> >>>> >>>> -- >>>> Walter Underwood >>>> Venture ASM, Troop 14, Palo Alto >>>> >>>> >>>> >>> >>> > > -- > Walter Underwood > Venture ASM, Troop 14, Palo Alto > > >