Here's my configuration in schmea.xml for the JiebaTokenizerFactory.
<fieldType name="text_chinese2" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" segMode="SEARCH"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15"/> </analyzer> <analyzer type="query"> <tokenizer class="analyzer.solr5.jieba.JiebaTokenizerFactory" segMode="SEARCH"/> <filter class="solr.CJKWidthFilterFactory"/> <filter class="solr.CJKBigramFilterFactory"/> <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> <field name="content" type="text_chinese2" indexed="true" stored="true" omitNorms="true" termVectors="true"/> Could there be any problems that might be causing the English characters issue? Regards, Edwin On 29 October 2015 at 17:51, Zheng Lin Edwin Yeo <edwinye...@gmail.com> wrote: > I would like to check, is it possible to use JiebaTokenizerFactory to > index Multilingual documents in Solr? > > I found that JiebaTokenizerFactory works better for Chinese characters as > compared to HMMChineseTokenizerFactory. > > However, for English characters, the JiebaTokenizerFactory is cutting the > words at the wrong place. For example, it will cut the word "water" as > follows: > *w|at|er* > > It means that Solr will search for 3 separate words of "w", "at" and "er" > instead of the entire word "water". > > Is there anyway to solve this problem, besides using a separate field for > English and Chinese characters? > > Regards, > Edwin >