You could certainly do it that way if you wanted.
The one point I would make here is that from a linguistic POV these are not
synonyms but are the same term written in a different alphabet.
François
On Mar 11, 2011, at 12:51 PM, Walter Underwood wrote:
> Sounds more like generating synonyms t
Tomás
The ICU code base is used by a *lot* so I think it is safe to say that it works
ok :)
François
On Mar 11, 2011, at 12:49 PM, Tomás Fernández Löbbe wrote:
> "the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ'
> or 'とよた' (Katakana and Hiragana respectively), not do
Sounds more like generating synonyms than conflating everything to one set of
kana.
Why not a filter that does that transliteration and adds a token at the some
position?
wunder
On Mar 11, 2011, at 9:49 AM, Tomás Fernández Löbbe wrote:
> "the issue has to do with recall, for example, I can wr
"the issue has to do with recall, for example, I can write 'Toyota' as 'トヨタ'
or 'とよた' (Katakana and Hiragana respectively), not doing the transliteration
will miss results."
Exactly, that's my problem, searching on a different alphabet than the one
on which it was indexed a document.
François, than
Good question about transliteration, the issue has to do with recall, for
example, I can write 'Toyota' as 'トヨタ' or 'とよた' (Katakana and Hiragana
respectively), not doing the transliteration will miss results. You will find
that the big search engines do the transliteration for you automatically.
Why not index it as-is? Solr can handle Unicode.
Transliterating hiragana to katakana is a very weird idea. I cannot imagine how
that would help.
You will need some sort of tokenization to find word boundaries. N-grams work
OK for search, but are really ugly for highlighting.
As far as I know,
Tomás
That wont really work, transliteration to Romaji works for individual terms
only so you would need to tokenize the Japanese prior to transliteration. I am
not sure what tool you plan to use for transliteration, I have used ICU in the
past and from what I can tell it does not transliterate