To elaborate.... StandardTokenizer comes into play for indexing and querying (and only if you have that configured for that field in schema.xml). But the original issue seems to be with actually parsing the content properly and storing it in the Lucene index, which is separate from the tokenization process altogether - I just wanted to point it out as something else you might encounter along the way.

        Erik



On Feb 28, 2008, at 11:26 AM, Ken Krugler wrote:

Hi Christian,

The documents I am trying to index with Solr contain characters from the CJK Extension B, which had been added to Unicode in version 3.1 (March 2001). Unfortunately, it seems to be the case that Solr (and maybe Lucene) do not
yet support these characters.

Solr seems to accept the documents without problem, but when I retrieve the documents, there are strange placeholders like #0; etc. in its place. Might
this be a configuration issue?

And as Erik mentioned, it appears that line 114 of StandardTokenizerImpl.jflex:

http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/ trunk/src/java/org/apache/lucene/analysis/standard/ StandardTokenizerImpl.jflex

needs to be updated to include the Extension B character range.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Reply via email to