Re: Multiple Japanese Alphabets in Solr

François Schiettecatte Fri, 11 Mar 2011 11:26:12 -0800

You could certainly do it that way if you wanted. 

The one point I would make here is that from a linguistic POV these are not 
synonyms but are the same term written in a different alphabet.


François

On Mar 11, 2011, at 12:51 PM, Walter Underwood wrote:

> Sounds more like generating synonyms than conflating everything to one set of 
> kana.
> 
> Why not a filter that does that transliteration and adds a token at the some 
> position?
> 
> wunder
> 
> On Mar 11, 2011, at 9:49 AM, Tomás Fernández Löbbe wrote:
> 
>> "the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ'
>> or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration
>> will miss results."
>> Exactly, that's my problem, searching on a different alphabet than the one
>> on which it was indexed a document.
>> François, thank you for your help. Have you used the new ICU Filters? Do
>> they work OK? (I know it doesn't do Kanji)
>> 
>> Tomás
>> 
>> 2011/3/11 François Schiettecatte <fschietteca...@gmail.com>
>> 
>>> Good question about transliteration, the issue has to do with recall, for
>>> example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana
>>> respectively), not doing the transliteration will miss results. You will
>>> find that the big search engines do the transliteration for you
>>> automatically. This issue get even more complicated when you dig into
>>> orthographic variation because Japanese orthography is very variable (ie
>>> there is more than one way to write a 'word'), as is tokenization (ie there
>>> is more than one way to tokenize it), see:
>>> 
>>>      http://www.cjk.org/cjk/reference/japvar.htm
>>> 
>>> I have used the Basis Technology software in the past, it is very good, but
>>> it is also very expensive.
>>> 
>>> François
>>> 
>>> On Mar 11, 2011, at 11:53 AM, Walter Underwood wrote:
>>> 
>>>> Why not index it as-is? Solr can handle Unicode.
>>>> 
>>>> Transliterating hiragana to katakana is a very weird idea. I cannot
>>> imagine how that would help.
>>>> 
>>>> You will need some sort of tokenization to find word boundaries. N-grams
>>> work OK for search, but are really ugly for highlighting.
>>>> 
>>>> As far as I know, there are no good-quality free tokenizers for Japanese.
>>> Basis Technology sells Japanese support that works with Lucene and Solr.
>>>> 
>>>> wunder
>>>> 
>>>> On Mar 11, 2011, at 8:09 AM, François Schiettecatte wrote:
>>>> 
>>>>> Tomás
>>>>> 
>>>>> That wont really work, transliteration to Romaji works for individual
>>> terms only so you would need to tokenize the Japanese prior to
>>> transliteration. I am not sure what tool you plan to use for
>>> transliteration, I have used ICU in the past and from what I can tell it
>>> does not transliterates Kanji. Besides transliterating Kanji is debatable
>>> for a variety of reasons.
>>>>> 
>>>>> What I would suggest is that you transliterate Hiragana to Katakana,
>>> leave the Kanji alone, and index/search using ngrams. If you want 'proper'
>>> tokenization I would recommend Mecab.
>>>>> 
>>>>> I have looked into this for a client and there is no clear cut solution.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> François
>>>>> 
>>>>> 
>>>>> On Mar 11, 2011, at 10:29 AM, Tomás Fernández Löbbe wrote:
>>>>> 
>>>>>> This question is probably not a completely Solr question but it's
>>> related to
>>>>>> it. I'm dealing with a Japanese Solr application in which I would like
>>> to be
>>>>>> able to search in any of the Japanese Alphabets. The content can also
>>> be in
>>>>>> any Japanese Alphabet. I've been thinking in this solution: Convert
>>>>>> everything to roma-ji, on Index time and query time.
>>>>>> For example:
>>>>>> 
>>>>>> Indexing time:
>>>>>> [Something in Hiragana] --> translate it to roma-ji --> index
>>>>>> 
>>>>>> Searching time:
>>>>>> [Something in Katakana] --> translate it to roma-ji --> search
>>>>>> or
>>>>>> [Something in Kanji] --> translate it to roma-ji --> search
>>>>>> 
>>>>>> I don't have a deep understanding of Japanese, and that's my problem.
>>> Did
>>>>>> somebody in the list tried something like this before? Did it work?
>>>>>> 
>>>>>> 
>>>>>> Thanks,
>>>>>> 
>>>>>> Tomás
>>>>> 
>>>> 
>>>> --
>>>> Walter Underwood
>>>> Venture ASM, Troop 14, Palo Alto
>>>> 
>>>> 
>>>> 
>>> 
>>> 
> 
> --
> Walter Underwood
> Venture ASM, Troop 14, Palo Alto
> 
> 
>

Re: Multiple Japanese Alphabets in Solr

Reply via email to