On Jul 2, 2008, at 9:16 PM, George Aroush wrote:
Has anyone created schema.xml for languages other then English?
Indeed.
I like to
see a working example mainly for CJK, German and French. If you
have can
you share them?
TO get me started, I created the following for German:
<fieldtype name="myfieldtype" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="0"
generateNumberParts="1" catenateWords="1" catenateNumbers="1"
catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory"
language="German" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldtype>
Will those filters work on German text?
One tip that will help is visiting http://localhost:8983/solr/admin/analysis.jsp
and test it out to see that you're getting the tokenization that you
desire on some sample text. Solr's analysis introspection is quite
nice and easy to tinker with.
Removing stop words before lower casing won't quite work though, as
StopFilter is case-sensitive with all stop words generally lowercased,
but other than relocating the StopFilterFactory in that chain it seems
reasonable.
As always, though, it depends on what you want to do with these
languages to offer more concrete recommendations.
Erik