On 5/7/15, 11:23 AM, Kuntal Ganguly wrote:
1) Is this a correct approach to do it? Or i'm missing something?
Does the user wants to see the documents that he/she doesn't understand?
The words such as "doctor", "taxi", etc. are common among many languages in 
Europe.
Would the Spanish user wants to see English documents?
Of course this issue can be worked-around by having a separate language field.

How do you handle word collision among languages ?
"kind" in German means "child" in English. If a German user search for articles
about children, they will find lots of unrelated English
articles about someone being kind.
This one too can be worked-around by having a language field.

By default, Solr/Lucene hits are sort by the relevancy scores and
the score calculation uses IDF. If a search term appears in many documents,
the score is low. Because virtually all German documents have "die", the 
particle,
the score of the English word "die" will be low also.

2) Can you give me an example where there will be problem with this above
new field type? A use-case/scenario with example will be very helpful.

If you have lots of Japanese documents indexed, try searching "京都" (Kyoto).
You will find many documents about Tokyo (東京) because the government
of the metropolitan Tokyo area is spelled as "東京都" = Tokyo Capital, which
generates two bigrams, 東京 and 京都.

Kuro



Reply via email to