I've tried to use solr.HMMChineseTokenizerFactory with the following configurations:
<fieldType name="text_chinese" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.HMMChineseTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> It is able to be indexed, but when I tried to search for the words, it matches many more other words and not just the words that I search. Why is this so? For example, the query http://localhost:8983/edm/collection3/highlight?q=我国 actually matches "title":["<em>我国</em>1<em>月份</em>的制造业<em>产值</em><em>同比</em>仅<em>增长</em>0"], Regards, Edwin On 10 June 2015 at 14:40, Alexandre Rafalovitch <arafa...@gmail.com> wrote: > You may find the series of article on CJK analysis/search helpful: > http://discovery-grindstone.blogspot.com.au/ > > It's a little out of date, but should be a very solid intro. > > Regards, > Alex. > ---- > Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter: > http://www.solr-start.com/ > > > On 10 June 2015 at 16:35, Zheng Lin Edwin Yeo <edwinye...@gmail.com> > wrote: > > Hi, > > > > I'm trying to index rich-text documents that are in chinese. Currently, > > there's no problem with indexing, but there's problem with the searching. > > > > Does anyone knows what is the best Tokenizer and Filter Factory to use? > I'm > > now using the solr.StandardTokenizerFactory which I heard that it's not > > very good for chinese. Is that true? > > > > > > Regards, > > Edwin >