You may be interested in a recent discussion that took place on a similar subject: http://www.mail-archive.com/solr-user@lucene.apache.org/msg09332.html
Nicolas -----Message d'origine----- De : David King [mailto:[EMAIL PROTECTED] Envoyé : mercredi 19 mars 2008 20:07 À : solr-user@lucene.apache.org Objet : Language support This has probably been asked before, but I'm having trouble finding it. Basically, we want to be able to search for content across several languages, given that we know what language a datum and a query are in. Is there an obvious way to do this? Here's the longer version: I am trying to index content that occurs in multiple languages, including Asian languages. I'm in the process of moving from PyLucene to Solr. In PyLucene, I would have a list of analysers: analyzers = dict(en = pyluc.SnowballAnalyzer("English"), cs = pyluc.CzechAnalyzer(), pt = pyluc.SnowballAnalyzer("Portuguese"), ... Then when I want to index something, I do writer = pyluc.IndexWriter(store, analyzer, create) writer.addDocument(d.doc) That is, I tell Lucene the language of every datum, and the analyser to use when writing out the field. Then when I want to search against it, I do analyzer = LanguageAnalyzer.getanal(lang) q = pyluc.QueryParser(field, analyzer).parse(value) And use that QueryParser to parse the query in the given language before sending it off to PyLucene. (off-topic: getanal() is perhaps my favourite function-name ever). So the language of a given datum is attached to the datum itself. In Solr, however, this appears to be attached to the field, not to the individual data in it: <fieldType name="text_greek" class="solr.TextField"> <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/> </fieldType> Does this mean there there's no way to have a single "contents" field that has content in multiple languages, and still have the queries be parsed and stemmed correctly? How are other people handling this? Does it makes sense to write a tokeniser factory and a query factory that look at, say, the 'lang' field and return the correct tokenisers? Does this already exist? The other alternative is to have a text_zh field, a text_en field, etc, and to modify the query to search on that field depending on the language of the query, but that seems kind of hacky to me, especially if a query may be against more than one language. Is this the accepted way to go about it? Is there a benefit to this method over writing a detecting tokeniser factory?