Re: Language Identification and Stemming

Jan Høydahl Sat, 02 Mar 2013 14:07:18 -0800

In addition to the text_lang fields you can of course have a text_general
field which is unstemmed, where you put documents that you don't yet have
language specific handling for.


One potential issue of multi language search is detecting the language of the 
query itself.
Sometimes your search page knows in advance what language will be input, then 
you can
target the search towards text_<lang> only. Other times you won't know what 
language
it is, and then you have a few choices:

a) Try to detect the language
b) Search across all languages (text_en OR text_fr OR ...)
c) Skip stemming and use only text_general

Detecting the language of a short 1-2 words query is hard. You will be able
to distinguish chinese from japanese from western languages based on unique 
characters,
but much harder to distinguish western languages.

Search across all languages works great, but you may get some false positives in
e.g. stemming when a word overlaps with different meaning in several languages.
Besides, if you have 200 languages in your index it is impractical to search 
across
200 fields. 

If you skip stemming you will in many cases still be able to build a great 
search,
but you may be better off trying to guess the input language by means of IP 
detection,
browser headers, statistical analysis or simply asking the user.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

1. mars 2013 kl. 23:47 skrev vybe3142 <vybe3...@gmail.com>:

> From your response, I gather that there's no way to maintain a single set of
> fields for multiple languages i.e. I can't use a field "text" for the body
> text. Instead, I would have to define text_en, text_fr, text_ru etc each
> mapped to their specific languages.
> 
> 
> 
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/Language-Identification-and-Stemming-tp4044116p4044132.html
> Sent from the Solr - User mailing list archive at Nabble.com.

Re: Language Identification and Stemming

Reply via email to