Thanks, Toru and Chris, I tried both the CJKTokenizer and CJKAnalyzer. Both return some unexpected highlight results when I tested with Germany. The field value I searched is "Ein Mann beißt den Hund". The search criteria is beißt.
When using CJKAnalyzer, beißt is treated as 2 single terms(bei and ß) the highlight result is: <str>Ein Mann <em>bei</em><em>ß</em>t den Hund</str> When using CJKTokenizer, beißt is treated as 3 single terms, the result is: <str>Ein Mann <em>bei</em><em>ß</em><em>t</em> den Hund</str> When using standard tokenizer, beißt is treated as a word, the result is: <str>Ein Mann <em>beißt</em> den Hund</str> I understand why the standard tokenizer treat beißt as a word, but don't know how CJKAnalyzer and CJKAnalyzer work, could anyone explain a little bit? Thanks Xuesong -----Original Message----- From: Toru Matsuzawa [mailto:[EMAIL PROTECTED] Sent: Monday, June 18, 2007 10:29 PM To: solr-user@lucene.apache.org Subject: Re: add CJKTokenizer to solr I'm sorry. Because it was not possible to append it, it sends it again. > > I got the error below after adding CJKTokenizer to schema.xml. I > > checked the constructor of CJKTokenizer, it requires a Reader parameter, > > I guess that's why I get this error, I searched the email archive, it > > seems working for other users. Does anyone know what is the problem? > > > CJKTokenizerFactory that I am using is appended. > -- package org.apache.solr.analysis.ja; import java.io.Reader; import org.apache.lucene.analysis.cjk.CJKTokenizer ; import org.apache.lucene.analysis.TokenStream; import org.apache.solr.analysis.BaseTokenizerFactory; /** * CJKTokenizer for Solr * @see org.apache.lucene.analysis.cjk.CJKTokenizer * @author matsu * */ public class CJKTokenizerFactory extends BaseTokenizerFactory { /** * @see org.apache.solr.analysis.TokenizerFactory#create(Reader) */ public TokenStream create(Reader input) { return new CJKTokenizer( input ); } } -- Trou Matsuzawa