To elaborate.... StandardTokenizer comes into play for indexing and
querying (and only if you have that configured for that field in
schema.xml). But the original issue seems to be with actually
parsing the content properly and storing it in the Lucene index,
which is separate from the tokenization process altogether - I just
wanted to point it out as something else you might encounter along
the way.
Erik
On Feb 28, 2008, at 11:26 AM, Ken Krugler wrote:
Hi Christian,
The documents I am trying to index with Solr contain characters
from the CJK
Extension B, which had been added to Unicode in version 3.1 (March
2001).
Unfortunately, it seems to be the case that Solr (and maybe
Lucene) do not
yet support these characters.
Solr seems to accept the documents without problem, but when I
retrieve the
documents, there are strange placeholders like #0; etc. in its
place. Might
this be a configuration issue?
And as Erik mentioned, it appears that line 114 of
StandardTokenizerImpl.jflex:
http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/
trunk/src/java/org/apache/lucene/analysis/standard/
StandardTokenizerImpl.jflex
needs to be updated to include the Extension B character range.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"