Re: CJKTokenizer in Solr 1.3? [Solution, wiki updater wanted]

Vinci Tue, 25 Mar 2008 12:58:57 -0700

Hi all,

After some test, I get it work :)
Reduced schema.xml: http://kwon37xi.springnote.com/pages/335478
Basically you need apply the change on schema.xml only, the class is in 1.3
nightly build. 
CHANGE: change the tokenizer element defined in all analyzer element,
especially <analyzer type="index"> and <analyzer type="query"> for fieldtype
"text" as follow in schema.xml:
<tokenizer class="solr.CJKTokenizerFactory"/>
Workable tokenizer element should look like this.

[The magic]
Very simple wrapper class wrap the lucene CJKTokenizer:
http://people.apache.org/~gsingers/solr-clover/reports/org/apache/solr/analysis/CapitalizationFilterFactory.html
Similar plugin as nutch, but this time you make sure the tokenization is
same in both query

[Attention you need to pay for documents]
For CJK User (or WINDOWS User), you need to pay special attention whenever
you prepare the document for Solr, especially you PROCESS the document in
WINDOWS (2000/XP/Vista and later)
1. The document CANNOT HAVE the "UTF-8 signature": Please use advance
editor, like Emeditor(Change in the time of saving), notepad++(change in
format in menubar), or gvim(default)...etc, you can use anything but NOT
NOTEPAD! 
2. The encoding of the document must be utf-8, not the system default -
since the utf8 without signature is very easy to be misidentified as ANSI,
then your index will be poisoned

DRAWBACK (or it should call Symptom?): You will see your index in Luke is
fine (I don't know what kind of magic lucene played, but this is what I
experienced 1 hour before), but search will return nothing even the query
parsed properly (via FULL interface with debug option ticked).

Diagnosis: 
i. prepare a document with English and UTF-8 character (The character should
be supported by your machine in ANSI mode, i.e. UTF-8 character come from
your machine encoding: If this is  ENC-JS or Shift-JIS, document should
contain Japanese character; if this is big-5 or gb-2312, document should
contain Chinese character), 
ii. Start Solr with jetty (mySolr). POST this document to index.
iii. Search with the English word in this document. 
    If you see the returned document contains broken characters (a series of
question mark in black diamond in firefox) -- this is what multibyte
encoding character look like when they are misinterpreted as utf-8 stream --
then you know you get trapped with windows utf-8 processing mechanism.

*Can somebody kindly post it to the wiki in appropriated format ? I am not
so familiar with the wiki syntax and no much time for formatting...

Thank you,
Vinci

Vinci wrote:
> 
> Hi,
> 
> I would like to ask, does any support of CJKTokenizer
> (org.apache.lucene.analysis.cjk.CJKTokenizer) available for Solr 1.3 now? 
> If it is supported, which nightly build I can try and how can I turn it
> on? (I have nightly build up to 2008 Mar 8 on hand)
> If it is not supported, how can I use plugin to turn on this feature in
> 1.3 nightly build?
> 
> Thank you,
> Vinci
> 
> 

-- 
View this message in context: 
http://www.nabble.com/CJKTokenizer-in-Solr-1.3--tp16260321p16287242.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: CJKTokenizer in Solr 1.3? [Solution, wiki updater wanted]

Reply via email to