Re: Korean Tokenizer in solr

Poornima Jay Sun, 13 Jul 2014 23:59:25 -0700

I have upgrade the solr version to 4.8.1. But after making changes in the 
schema file i am getting the below error
Error instantiating class: 
'org.apache.lucene.analysis.cjk.CJKBigramFilterFactory'
I assume CJKBigramFilterFactory and CJKFoldingFilterFactory are supported in 
4.8.1. Do I need to make any configuration changes to get this working.


Please advice.

Regards,
Poornima


On Thursday, 10 July 2014 2:45 PM, Alexandre Rafalovitch <[email protected]> 
wrote:
 


I would suggest you read through all 12 (?) articles in this series:
http://discovery-grindstone.blogspot.com/2013/10/cjk-with-solr-for-libraries-part-1.html
. It will probably lay out most of the issues for you.

And if you are starting, I would really suggest using the latest Solr
(4.9). A lot more people remember what the latest version has then
what was in 3.6. And, as the series above will tell you, some relevant
issues had been fixed in more recent Solr versions.

Regards,
   Alex.
Personal website: http://www.outerthoughts.com/
Current project: http://www.solr-start.com/ - Accelerating your Solr proficiency



On Thu, Jul 10, 2014 at 4:11 PM, Poornima Jay
<[email protected]> wrote:
> Till now I was thinking solr will support KoreanTokenizer. I haven't used any 
> other 3rd party one.
> Actually the issue i am facing is I need to integrate English, Chinese, 
> Japanese and Korean language search in a single site. Based on the user's 
> selected language to search the fields will be queried appropriately.
>
> I tried using cjk for all the 3 languages like below but only few search 
> terms work for Chinese and Japanese. nothing works for Korean.
>
> <fieldtype name="text_cjk" class="solr.TextField" 
> positionIncrementGap="10000" autoGeneratePhraseQueries="false">
>      <analyzer>
>         <tokenizer class="solr.CJKTokenizerFactory" />
>         <filter class="solr.CJKWidthFilterFactory"/>
>         <filter class="edu.stanford.lucene.analysis.CJKFoldingFilterFactory"/>
>         <filter class="solr.ICUTransformFilterFactory" 
>id="Traditional-Simplified"/>
>         <filter class="solr.ICUTransformFilterFactory" 
>id="Katakana-Hiragana"/>
>         <filter class="solr.ICUFoldingFilterFactory"/>
>         <filter class="solr.CJKBigramFilterFactory" han="true" 
>hiragana="true" katakana="true" hangul="true" outputUnigrams="true" />
>       </analyzer>
>     </fieldtype>
>
> So i tried to implement individual fieldtype for each language as below
>
> Chinese
>  <fieldType name="text_cjk" class="solr.TextField" 
>positionIncrementGap="1000" autoGeneratePhraseQueries="false">
>      <analyzer>
>          <tokenizer class="solr.ICUTokenizerFactory"/>
>            <filter class="solr.ICUFoldingFilterFactory"/>
>            <filter class="solr.CJKWidthFilterFactory"/>
>            <filter class="solr.CJKBigramFilterFactory"/>
>        </analyzer>
>     </fieldType>
>
> Japanese
> <fieldType name="text_ja" class="solr.TextField" positionIncrementGap="100" 
> autoGeneratePhraseQueries="false">
>    <analyzer>
>      <tokenizer class="solr.JapaneseTokenizerFactory" mode="search"/>
>       <filter class="solr.JapaneseBaseFormFilterFactory"/>
>       <filter class="solr.JapanesePartOfSpeechStopFilterFactory" 
>tags="stoptags_ja.txt" />
>       <filter class="solr.CJKWidthFilterFactory"/>
>       <filter class="solr.StopFilterFactory" ignoreCase="true" 
>words="stopwords_ja.txt" />
>       <filter class="solr.JapaneseKatakanaStemFilterFactory" 
>minimumLength="4"/>
>       <filter class="solr.LowerCaseFilterFactory"/>
>    </analyzer>
> </fieldType>
>
> Korean
> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000" 
> autoGeneratePhraseQueries="false">
>       <analyzer type="index">
>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>         <filter class="solr.KoreanFilterFactory" hasOrigin="true" 
>hasCNoun="true"  bigrammable="true"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" 
>words="stopwords_kr.txt"/>
>       </analyzer>
>       <analyzer type="query">
>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>         <filter class="solr.KoreanFilterFactory" hasOrigin="false" 
>hasCNoun="false"  bigrammable="false"/>
>         <filter class="solr.LowerCaseFilterFactory"/>
>         <filter class="solr.StopFilterFactory" ignoreCase="true" 
>words="stopwords_kr.txt"/>
>       </analyzer>
>     </fieldType>
>
> I am really struck how to implement this. Please help me.
>
> Thanks,
> Poornima
>
>
>
> On Thursday, 10 July 2014 2:22 PM, Alexandre Rafalovitch <[email protected]> 
> wrote:
>
>
>
> I don't think Solr ships with Korean Tokenizer, does it?
>
> If you are using a 3rd party one, you need to give full class name,
> not just solr.Korean... And you need the library added in the lib
> statement in solrconfig.xml (at least in Solr 4).
>
> Regards,
>    Alex.
> Personal website: http://www.outerthoughts.com/
> Current project: http://www.solr-start.com/ - Accelerating your Solr 
> proficiency
>
>
>
> On Thu, Jul 10, 2014 at 3:23 PM, Poornima Jay
> <[email protected]> wrote:
>> I have defined the fieldtype inside the fields section.  When i checked the 
>> error log i found the below error
>>
>> Caused by: java.lang.ClassNotFoundException: solr.KoreanTokenizerFactory
>>
>> SEVERE: org.apache.solr.common.SolrException: analyzer without class or 
>> tokenizer & filter list
>>
>>
>> Do i need to add any libraries for koreanTokenizer?
>>
>> Regards,
>> Poornima
>>
>>
>> On Thursday, 10 July 2014 1:03 PM, Alexandre Rafalovitch 
>> <[email protected]> wrote:
>>
>>
>>
>> Double check your xml file that you don't - for example - define your
>> fieldType outside of fields section. Or maybe you have exception
>> earlier about some component in the type definition.
>>
>> This is not about Korean language, it seems. Something more
>> fundamentally about XML config.
>>
>> Regards,
>>    Alex.
>> Personal website: http://www.outerthoughts.com/
>> Current project: http://www.solr-start.com/ - Accelerating your Solr 
>> proficiency
>>
>>
>>
>> On Thu, Jul 10, 2014 at 2:26 PM, Poornima Jay
>> <[email protected]> wrote:
>>> Hi,
>>>
>>> Anyone tried to implement korean language in solr 3.6.1. I define the field
>>> as below in my schema file but the fieldtype is not working.
>>>
>>> <fieldType name="text_kr" class="solr.TextField" positionIncrementGap="1000"
>>>>
>>>       <analyzer type="index">
>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="true"
>>> hasCNoun="true"  bigrammable="true"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords_kr.txt"/>
>>>       </analyzer>
>>>       <analyzer type="query">
>>>         <tokenizer class="solr.KoreanTokenizerFactory"/>
>>>         <filter class="solr.KoreanFilterFactory" hasOrigin="false"
>>> hasCNoun="false"  bigrammable="false"/>
>>>         <filter class="solr.LowerCaseFilterFactory"/>
>>>         <filter class="solr.StopFilterFactory" ignoreCase="true"
>>> words="stopwords_kr.txt"/>
>>>       </analyzer>
>>>     </fieldType>
>>>
>>> Error : Caused by: org.apache.solr.common.SolrException: Unknown fieldtype
>>> 'text_kr' specified on field product_name_kr
>>>
>>> Regards,
>>> Poornima
>>>

Re: Korean Tokenizer in solr

Reply via email to