subject:"Re\: Multiple Japanese Alphabets in Solr"

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread François Schiettecatte

You could certainly do it that way if you wanted. The one point I would make here is that from a linguistic POV these are not synonyms but are the same term written in a different alphabet. François On Mar 11, 2011, at 12:51 PM, Walter Underwood wrote: > Sounds more like generating synonyms t

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread François Schiettecatte

Tomás The ICU code base is used by a *lot* so I think it is safe to say that it works ok :) François On Mar 11, 2011, at 12:49 PM, Tomás Fernández Löbbe wrote: > "the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ' > or 'とよた' (Katakana and Hiragana respectively), not do

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread Walter Underwood

Sounds more like generating synonyms than conflating everything to one set of kana. Why not a filter that does that transliteration and adds a token at the some position? wunder On Mar 11, 2011, at 9:49 AM, Tomás Fernández Löbbe wrote: > "the issue has to do with recall, for example, I can wr

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread Tomás Fernández Löbbe

"the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration will miss results." Exactly, that's my problem, searching on a different alphabet than the one on which it was indexed a document. François, than

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread François Schiettecatte

Good question about transliteration, the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration will miss results. You will find that the big search engines do the transliteration for you automatically.

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread Walter Underwood

Why not index it as-is? Solr can handle Unicode. Transliterating hiragana to katakana is a very weird idea. I cannot imagine how that would help. You will need some sort of tokenization to find word boundaries. N-grams work OK for search, but are really ugly for highlighting. As far as I know,

Re: Multiple Japanese Alphabets in Solr

2011-03-11 Thread François Schiettecatte

Tomás That wont really work, transliteration to Romaji works for individual terms only so you would need to tokenize the Japanese prior to transliteration. I am not sure what tool you plan to use for transliteration, I have used ICU in the past and from what I can tell it does not transliterate

Re: Multiple Japanese Alphabets in Solr

Re: Multiple Japanese Alphabets in Solr

Re: Multiple Japanese Alphabets in Solr

Re: Multiple Japanese Alphabets in Solr

Re: Multiple Japanese Alphabets in Solr

Re: Multiple Japanese Alphabets in Solr

Re: Multiple Japanese Alphabets in Solr

7 matches

Site Navigation

Mail list logo

Footer information