For chinese search, you may also consider
org.apache.lucene.analysis.cn.ChineseTokenizer.

ChineseTokenizer Description: Extract tokens from the Stream using
Character.getType() Rule: A Chinese character as a single token
Copyright: Copyright (c) 2001 Company: The difference between thr
ChineseTokenizer and the CJKTokenizer (id=23545) is that they have
different token parsing logic. Let me use an example. If having a
Chinese text "C1C2C3C4" to be indexed, the tokens returned from the
ChineseTokenizer are C1, C2, C3, C4. And the tokens returned from the
CJKTokenizer are C1C2, C2C3, C3C4. Therefore the index the CJKTokenizer
created is much larger. The problem is that when searching for C1, C1C2,
C1C3, C4C2, C1C2C3 ... the ChineseTokenizer works, but the CJKTokenizer
will not work.




-----Original Message-----
From: Teruhiko Kurosaka [mailto:[EMAIL PROTECTED] 
Sent: Friday, June 22, 2007 2:25 PM
To: solr-user@lucene.apache.org
Subject: RE: Multi-language Tokenizers / Filters recommended? 

Hi Daniel,
As you know, Chinese and Japanese does not use
space or any other delimiters to break words.
To overcome this problem, CJKTokenizer uses a method
called bi-gram where the run of ideographic (=Chinese) 
characters are made into tokens of two neighboring
characters.  So a run of five characters ABCDE
will result in four tokens AB, BC, CD, and DE.

So search for "BC" will hits this text,
even if AB is a word and CD is another word.
That is, it increases the noise in the hits.
I don't know how much real problem it would be
for Chinese.  But for Japanese, my native language,
this is a problem. Because of this, search result
for Kyoto will include false hits of documents
that incldue Tokyoto, i.e. Tokyo prefecture.

There is another method called morphological
analysis, which uses dictionaries and grammer
rules to break down text into real words.  You
might want to consider this method. 

-kuro  


Reply via email to