Hi all, After some test, I get it work :) Reduced schema.xml: http://kwon37xi.springnote.com/pages/335478 Basically you need apply the change on schema.xml only, the class is in 1.3 nightly build. CHANGE: change the tokenizer element defined in all analyzer element, especially <analyzer type="index"> and <analyzer type="query"> for fieldtype "text" as follow in schema.xml: <tokenizer class="solr.CJKTokenizerFactory"/> Workable tokenizer element should look like this.
[The magic] Very simple wrapper class wrap the lucene CJKTokenizer: http://people.apache.org/~gsingers/solr-clover/reports/org/apache/solr/analysis/CapitalizationFilterFactory.html Similar plugin as nutch, but this time you make sure the tokenization is same in both query [Attention you need to pay for documents] For CJK User (or WINDOWS User), you need to pay special attention whenever you prepare the document for Solr, especially you PROCESS the document in WINDOWS (2000/XP/Vista and later) 1. The document CANNOT HAVE the "UTF-8 signature": Please use advance editor, like Emeditor(Change in the time of saving), notepad++(change in format in menubar), or gvim(default)...etc, you can use anything but NOT NOTEPAD! 2. The encoding of the document must be utf-8, not the system default - since the utf8 without signature is very easy to be misidentified as ANSI, then your index will be poisoned DRAWBACK (or it should call Symptom?): You will see your index in Luke is fine (I don't know what kind of magic lucene played, but this is what I experienced 1 hour before), but search will return nothing even the query parsed properly (via FULL interface with debug option ticked). Diagnosis: i. prepare a document with English and UTF-8 character (The character should be supported by your machine in ANSI mode, i.e. UTF-8 character come from your machine encoding: If this is ENC-JS or Shift-JIS, document should contain Japanese character; if this is big-5 or gb-2312, document should contain Chinese character), ii. Start Solr with jetty (mySolr). POST this document to index. iii. Search with the English word in this document. If you see the returned document contains broken characters (a series of question mark in black diamond in firefox) -- this is what multibyte encoding character look like when they are misinterpreted as utf-8 stream -- then you know you get trapped with windows utf-8 processing mechanism. *Can somebody kindly post it to the wiki in appropriated format ? I am not so familiar with the wiki syntax and no much time for formatting... Thank you, Vinci Vinci wrote: > > Hi, > > I would like to ask, does any support of CJKTokenizer > (org.apache.lucene.analysis.cjk.CJKTokenizer) available for Solr 1.3 now? > If it is supported, which nightly build I can try and how can I turn it > on? (I have nightly build up to 2008 Mar 8 on hand) > If it is not supported, how can I use plugin to turn on this feature in > 1.3 nightly build? > > Thank you, > Vinci > > -- View this message in context: http://www.nabble.com/CJKTokenizer-in-Solr-1.3--tp16260321p16287242.html Sent from the Solr - User mailing list archive at Nabble.com.