On Jul 2, 2008, at 9:16 PM, George Aroush wrote:
Has anyone created schema.xml for languages other then English?

Indeed.

 I like to
see a working example mainly for CJK, German and French. If you have can
you share them?

TO get me started, I created the following for German:

 <fieldtype name="myfieldtype" class="solr.TextField">
   <analyzer>
     <tokenizer class="solr.WhitespaceTokenizerFactory"/>
     <filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="0"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0"/>
     <filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German" />
     <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
   </analyzer>
 </fieldtype>

Will those filters work on German text?


One tip that will help is visiting http://localhost:8983/solr/admin/analysis.jsp and test it out to see that you're getting the tokenization that you desire on some sample text. Solr's analysis introspection is quite nice and easy to tinker with.

Removing stop words before lower casing won't quite work though, as StopFilter is case-sensitive with all stop words generally lowercased, but other than relocating the StopFilterFactory in that chain it seems reasonable.

As always, though, it depends on what you want to do with these languages to offer more concrete recommendations.

        Erik

Reply via email to