Re: how to Index and Search non-Eglish Text in solr

Erick Erickson Wed, 08 Jun 2011 06:18:08 -0700

This page is a handy reference for individual languages...
http://wiki.apache.org/solr/LanguageAnalysis


But the usual approach, especially for Chinese/Japanese/Korean
(CJK) is to index the content in different fields with language-specific
analyzers then spread your search across the language-specific
fields (e.g. title_en, title_fr, title_ar). Stemming and stopwords
particularly give "surprising" results if you put words from different
languages in the same field.

Best
Erick

On Wed, Jun 8, 2011 at 8:34 AM, Mohammad Shariq <shariqn...@gmail.com> wrote:
> Hi,
> I had setup solr( solr-1.4 on Ubuntu 10.10) for indexing news articles in
> English, but my requirement extend to index the news of other languages too.
>
> This is how my schema looks :
> <field name="news" type="text" indexed="true" stored="false"
> required="false"/>
>
>
> And the "text" Field in schema.xml looks like :
>
> <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
>    <analyzer type="index">
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="1" catenateNumbers="1"
> catenateAll="0" splitOnCaseChange="1"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>    </analyzer>
>    <analyzer type="query">
>       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
>       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
> ignoreCase="true" expand="true"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt" enablePositionIncrements="true"/>
>       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"
> generateNumberParts="1" catenateWords="0" catenateNumbers="0"
> catenateAll="0" splitOnCaseChange="1"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>       <filter class="solr.SnowballPorterFilterFactory" language="English"
> protected="protwords.txt"/>
>    </analyzer>
> </fieldType>
>
>
> My Problem is :
> Now I want to index the news articles in other languages to e.g.
> Chinese,Japnese.
> How I can I modify my text field so that I can Index the news in other lang
> too and make it searchable ??
>
> Thanks
> Shariq
>
>
>
>
>
> --
> View this message in context: 
> http://lucene.472066.n3.nabble.com/how-to-Index-and-Search-non-Eglish-Text-in-solr-tp3038851p3038851.html
> Sent from the Solr - User mailing list archive at Nabble.com.
>

Re: how to Index and Search non-Eglish Text in solr

Reply via email to