Christian,
This bit of trivia is probably useful to you as well. Lucene's
StandardTokenizer uses these Unicode ranges for CJK characters:
KOREAN = [\uac00-\ud7af\u1100-\u11ff]
// Chinese, Japanese
CJ = [\u3040-\u318f\u3100-\u312f\u3040-\u309F\u30A0-\u30FF
\u31F0-\u31FF\u3300-\u337f\u3400-\u4dbf\u4e00-\u9fff\uf900-\ufaff
\uff65-\uff9f]
I haven't done my homework to correlate that with CJK Extension B,
but I bet you know! :)
Erik
On Feb 27, 2008, at 10:23 PM, Christian Wittern wrote:
Hi there,
The documents I am trying to index with Solr contain characters
from the CJK
Extension B, which had been added to Unicode in version 3.1 (March
2001).
Unfortunately, it seems to be the case that Solr (and maybe Lucene)
do not
yet support these characters.
Solr seems to accept the documents without problem, but when I
retrieve the
documents, there are strange placeholders like #0; etc. in its
place. Might
this be a configuration issue?
While most of the characters in this range are very rare, due to
the latest
mapping tables between Unicode and the Japanese JIS coded character
sets,
some of the characters in everyday use in Japan are now encoded in
this
area. It does therefore seems highly desirable that this problem gets
solved. I am testing this on a Mac OS X 10.5.2 system, with Java
1.5.0_13
and Solr 1.2.0.
Any hints appreciated,
Christian Wittern
--
Christian Wittern
Institute for Research in Humanities, Kyoto University
47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN