Hi Christian,

The documents I am trying to index with Solr contain characters from the CJK
Extension B, which had been added to Unicode in version 3.1 (March 2001).
Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not
yet support these characters.

Solr seems to accept the documents without problem, but when I retrieve the
documents, there are strange placeholders like #0; etc. in its place.  Might
this be a configuration issue?

And as Erik mentioned, it appears that line 114 of StandardTokenizerImpl.jflex:

http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex

needs to be updated to include the Extension B character range.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Reply via email to