In addition to the text_lang fields you can of course have a text_general field which is unstemmed, where you put documents that you don't yet have language specific handling for.
One potential issue of multi language search is detecting the language of the query itself. Sometimes your search page knows in advance what language will be input, then you can target the search towards text_<lang> only. Other times you won't know what language it is, and then you have a few choices: a) Try to detect the language b) Search across all languages (text_en OR text_fr OR ...) c) Skip stemming and use only text_general Detecting the language of a short 1-2 words query is hard. You will be able to distinguish chinese from japanese from western languages based on unique characters, but much harder to distinguish western languages. Search across all languages works great, but you may get some false positives in e.g. stemming when a word overlaps with different meaning in several languages. Besides, if you have 200 languages in your index it is impractical to search across 200 fields. If you skip stemming you will in many cases still be able to build a great search, but you may be better off trying to guess the input language by means of IP detection, browser headers, statistical analysis or simply asking the user. -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com 1. mars 2013 kl. 23:47 skrev vybe3142 <vybe3...@gmail.com>: > From your response, I gather that there's no way to maintain a single set of > fields for multiple languages i.e. I can't use a field "text" for the body > text. Instead, I would have to define text_en, text_fr, text_ru etc each > mapped to their specific languages. > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Language-Identification-and-Stemming-tp4044116p4044132.html > Sent from the Solr - User mailing list archive at Nabble.com.