Re: Question regarding indexing multiple languages, stopwords, etc.

Otis Gospodnetic Mon, 21 Feb 2011 20:51:50 -0800

Greg,

You need to get stopword lists for your 6 languages.  Then you need to create 
new field types just like that 'text' type, one for each language.  Point them 
to the appropriate stopwords files and instead of "English" specify each one of 
your languages.  You can either index each language in its own index or put 
them 
all in the same index, in which case you'll want fields like title_en, 
title_fr, 
etc.


Check http://search-lucene.com/ - this multilingual stuff is a common topic.

Otis
----
Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
Lucene ecosystem search :: http://search-lucene.com/



----- Original Message ----
> From: Greg Georges <greg.geor...@biztree.com>
> To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org>
> Sent: Mon, February 21, 2011 4:27:46 PM
> Subject: Question regarding indexing multiple languages, stopwords, etc.
> 
> Hello all,
> 
> I have gotten my DataImporthandler to index my data from my  MySQL database. 
> I 
>was looking at the schema tool and noticing that stopwords in  different 
>languages are being indexed as terms. The 6 languages we have are  English, 
>French, Spanish, Chinese, German and Italian.
> 
> Right now I am  using the basic schema configuration for English. How do I 
>define them for  others languages? I have looked at the wiki page 
>(http://wiki.apache.org/solr/LanguageAnalysis) but I would like to have an  
>example configuration for all the languages I need. Also I need a list of  
>stopwords for these languages.  So far I have this
> 
> <fieldType  name="text" class="solr.TextField" positionIncrementGap="100">
>        <analyzer type="index">
>          <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>          <!-- in this example, we will only use synonyms at query  time
>         <filter class="solr.SynonymFilterFactory"  
>synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
>          -->
> 
>         <filter  class="solr.StopFilterFactory"
>                  ignoreCase="true"
>                  words="stopwords.txt"
>                  enablePositionIncrements="true"
>                  />
>         <filter  class="solr.WordDelimiterFilterFactory" 
> generateWordParts="1"  
>generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="  
>splitOnCaseChange="1"/>
>         <filter  class="solr.LowerCaseFilterFactory"/>
>          <filter class="solr.SnowballPorterFilterFactory" language="English"  
>protected="protwords.txt"/>
>        </analyzer>
> 
> Thanks in advance
> 
> Greg
>

Re: Question regarding indexing multiple languages, stopwords, etc.

Reply via email to