Hi Christian,

The documents I am trying to index with Solr contain characters from the CJK
Extension B, which had been added to Unicode in version 3.1 (March 2001).
Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not
yet support these characters.

Solr seems to accept the documents without problem, but when I retrieve the
documents, there are strange placeholders like #0; etc. in its place.  Might
this be a configuration issue?

1. What encoding are you using when pushing these documents to Solr? Both as specified in the XML, and the POST request. And there's a separate issue about the mime-type you use for the POST, if you're doing it yourself (not using the latest scripts from Solr).

2. What do these characters look like in the XML you're pushing? For example, if they are encoded as two surrogate characters instead of one code point from the extension B set, most XML parsers will not handle it correctly. This is the most source of similar issues I've seen.

3. Do the base plane characters (code points < U+10000) round-trip correctly?

One potential issue is the XML parser being used - most have been updated to handle extended Unicode code points, but there were a few older parsers that still failed to handle &#20103, for example.

-- Ken


While most of the characters in this range are very rare, due to the latest
mapping tables between Unicode and the Japanese JIS coded character sets,
some of the characters in everyday use in Japan are now encoded in this
area.  It does therefore seems highly desirable that this problem gets
solved.  I am testing this on a Mac OS X 10.5.2 system, with Java 1.5.0_13
and Solr 1.2.0.

Any hints appreciated,

Christian Wittern


--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Reply via email to