Re: Indexing documents in Chinese

Zheng Lin Edwin Yeo Wed, 10 Jun 2015 02:23:34 -0700

I've tried to use solr.HMMChineseTokenizerFactory with the following
configurations:


<fieldType name="text_chinese" class="solr.TextField"
positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.HMMChineseTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
words="org/apache/lucene/analysis/cn/smart/stopwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
      </analyzer>
    </fieldType>

It is able to be indexed, but when I tried to search for the words, it
matches many more other words and not just the words that I search. Why is
this so?

For example, the query
http://localhost:8983/edm/collection3/highlight?q=我国

actually matches

"title":["<em>我国</em>1<em>月份</em>的制造业<em>产值</em><em>同比</em>仅<em>增长</em>0"],


Regards,
Edwin



On 10 June 2015 at 14:40, Alexandre Rafalovitch <arafa...@gmail.com> wrote:

> You may find the series of article on CJK analysis/search helpful:
> http://discovery-grindstone.blogspot.com.au/
>
> It's a little out of date, but should be a very solid intro.
>
> Regards,
>    Alex.
> ----
> Solr Analyzers, Tokenizers, Filters, URPs and even a newsletter:
> http://www.solr-start.com/
>
>
> On 10 June 2015 at 16:35, Zheng Lin Edwin Yeo <edwinye...@gmail.com>
> wrote:
> > Hi,
> >
> > I'm trying to index rich-text documents that are in chinese. Currently,
> > there's no problem with indexing, but there's problem with the searching.
> >
> > Does anyone knows what is the best Tokenizer and Filter Factory to use?
> I'm
> > now using the solr.StandardTokenizerFactory which I heard that it's not
> > very good for chinese. Is that true?
> >
> >
> > Regards,
> > Edwin
>

Re: Indexing documents in Chinese

Reply via email to