Re: schema.xml for CJK, German, French, etc.

Erik Hatcher Wed, 02 Jul 2008 18:41:06 -0700


On Jul 2, 2008, at 9:16 PM, George Aroush wrote:

Has anyone created schema.xml for languages other then English?


Indeed.

 I like to

see a working example mainly for CJK, German and French. If youhave can

you share them?

TO get me started, I created the following for German:

 <fieldtype name="myfieldtype" class="solr.TextField">
   <analyzer>
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>

<filter class="solr.WordDelimiterFilterFactory"generateWordParts="0"

generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0"/>
     <filter class="solr.LowerCaseFilterFactory"/>

<filter class="solr.SnowballPorterFilterFactory"language="German" />

     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
 </fieldtype>

Will those filters work on German text?

One tip that will help is visiting http://localhost:8983/solr/admin/analysis.jspand test it out to see that you're getting the tokenization that youdesire on some sample text. Solr's analysis introspection is quitenice and easy to tinker with.

Removing stop words before lower casing won't quite work though, asStopFilter is case-sensitive with all stop words generally lowercased,but other than relocating the StopFilterFactory in that chain it seemsreasonable.

As always, though, it depends on what you want to do with theselanguages to offer more concrete recommendations.


        Erik

Re: schema.xml for CJK, German, French, etc.

Reply via email to