Thanks, Toru and Chris,
I tried both the CJKTokenizer and CJKAnalyzer. Both return some unexpected 
highlight results when I tested with Germany. The field value I searched is 
"Ein Mann beißt den Hund".  The search criteria is beißt. 

When using CJKAnalyzer, beißt is treated as 2 single terms(bei and ß) the 
highlight result is: 
<str>Ein Mann <em>bei</em><em>ß</em>t den Hund</str> 

When using CJKTokenizer, beißt is treated as 3 single terms, the result is:
<str>Ein Mann <em>bei</em><em>ß</em><em>t</em> den Hund</str>

When using standard tokenizer, beißt is treated as a word, the result is:
<str>Ein Mann <em>beißt</em> den Hund</str>


I understand why the standard tokenizer treat beißt as a word, but don't know 
how CJKAnalyzer and CJKAnalyzer work, could anyone explain a little bit?


Thanks
Xuesong

-----Original Message-----
From: Toru Matsuzawa [mailto:[EMAIL PROTECTED] 
Sent: Monday, June 18, 2007 10:29 PM
To: solr-user@lucene.apache.org
Subject: Re: add CJKTokenizer to solr

I'm sorry. Because it was not possible to append it, 
it sends it again. 

> > I got the error below after adding CJKTokenizer to schema.xml.  I
> > checked the constructor of CJKTokenizer, it requires a Reader parameter,
> > I guess that's why I get this error, I searched the email archive, it
> > seems working for other users. Does anyone know what is the problem?
> 
> 
> CJKTokenizerFactory that I am using is appended.
> 
--
package org.apache.solr.analysis.ja;

import java.io.Reader;
import org.apache.lucene.analysis.cjk.CJKTokenizer ;

import org.apache.lucene.analysis.TokenStream;
import org.apache.solr.analysis.BaseTokenizerFactory;

/**
 * CJKTokenizer for Solr
 * @see org.apache.lucene.analysis.cjk.CJKTokenizer
 * @author matsu
 *
 */
public class CJKTokenizerFactory extends BaseTokenizerFactory {

  /**
   * @see org.apache.solr.analysis.TokenizerFactory#create(Reader)
   */
  public TokenStream create(Reader input) {
    return new CJKTokenizer( input );
  }

}


-- 
Trou Matsuzawa



Reply via email to