Re: no support for CJK characters from Extension B in Solr

Erik Hatcher Thu, 28 Feb 2008 08:49:43 -0800

To elaborate.... StandardTokenizer comes into play for indexing andquerying (and only if you have that configured for that field inschema.xml). But the original issue seems to be with actuallyparsing the content properly and storing it in the Lucene index,which is separate from the tokenization process altogether - I justwanted to point it out as something else you might encounter alongthe way.


        Erik




On Feb 28, 2008, at 11:26 AM, Ken Krugler wrote:

Hi Christian,
The documents I am trying to index with Solr contain charactersfrom the CJKExtension B, which had been added to Unicode in version 3.1 (March2001).Unfortunately, it seems to be the case that Solr (and maybeLucene) do not
yet support these characters.
Solr seems to accept the documents without problem, but when Iretrieve thedocuments, there are strange placeholders like #0; etc. in itsplace. Might
this be a configuration issue?
And as Erik mentioned, it appears that line 114 ofStandardTokenizerImpl.jflex:
http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
needs to be updated to include the Extension B character range.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: no support for CJK characters from Extension B in Solr

Reply via email to