Re: no support for CJK characters from Extension B in Solr

Erik Hatcher Thu, 28 Feb 2008 04:34:28 -0800

Christian,

This bit of trivia is probably useful to you as well. Lucene'sStandardTokenizer uses these Unicode ranges for CJK characters:


KOREAN     = [\uac00-\ud7af\u1100-\u11ff]

// Chinese, Japanese

CJ = [\u3040-\u318f\u3100-\u312f\u3040-\u309F\u30A0-\u30FF\u31F0-\u31FF\u3300-\u337f\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff\uff65-\uff9f]

I haven't done my homework to correlate that with CJK Extension B,but I bet you know! :)


        Erik


On Feb 27, 2008, at 10:23 PM, Christian Wittern wrote:

Hi there,
The documents I am trying to index with Solr contain charactersfrom the CJKExtension B, which had been added to Unicode in version 3.1 (March2001).Unfortunately, it seems to be the case that Solr (and maybe Lucene)do not
yet support these characters.
Solr seems to accept the documents without problem, but when Iretrieve thedocuments, there are strange placeholders like #0; etc. in itsplace. Might
this be a configuration issue?
While most of the characters in this range are very rare, due tothe latestmapping tables between Unicode and the Japanese JIS coded charactersets,some of the characters in everyday use in Japan are now encoded inthis
area.  It does therefore seems highly desirable that this problem gets
solved. I am testing this on a Mac OS X 10.5.2 system, with Java1.5.0_13
and Solr 1.2.0.

Any hints appreciated,

Christian Wittern


--
 Christian Wittern
 Institute for Research in Humanities, Kyoto University
 47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

Re: no support for CJK characters from Extension B in Solr

Reply via email to