On 5/7/15, 11:23 AM, Kuntal Ganguly wrote:
1) Is this a correct approach to do it? Or i'm missing something?
Does the user wants to see the documents that he/she doesn't understand? The words such as "doctor", "taxi", etc. are common among many languages in Europe. Would the Spanish user wants to see English documents? Of course this issue can be worked-around by having a separate language field.
How do you handle word collision among languages ? "kind" in German means "child" in English. If a German user search for articles about children, they will find lots of unrelated English articles about someone being kind. This one too can be worked-around by having a language field. By default, Solr/Lucene hits are sort by the relevancy scores and the score calculation uses IDF. If a search term appears in many documents, the score is low. Because virtually all German documents have "die", the particle, the score of the English word "die" will be low also.
2) Can you give me an example where there will be problem with this above new field type? A use-case/scenario with example will be very helpful.
If you have lots of Japanese documents indexed, try searching "京都" (Kyoto). You will find many documents about Tokyo (東京) because the government of the metropolitan Tokyo area is spelled as "東京都" = Tokyo Capital, which generates two bigrams, 東京 and 京都. Kuro