Re: no support for CJK characters from Extension B in Solr

Christian Wittern Thu, 28 Feb 2008 15:56:39 -0800

Thanks to all for clearing this up. It seems we are still quite faraway from full Unicode support:-(As to the questions about the encoding in previous messages, all of theother characters in the documents come through without a glitch, sothere is definitely no other issue involved.

Christian

Erik Hatcher wrote:

Wow - great stuff Steve!
As for StandardTokenizer and Java version - no worries there really,as Solr itself requires Java 1.5+, so when such a tokenizer is madeavailable it could be used just fine in Solr even if it isn't builtinto a core Lucene release for a while.
    Erik



On Feb 28, 2008, at 12:08 PM, Steven A Rowe wrote:
On 02/28/2008 at 11:26 AM, Ken Krugler wrote:
And as Erik mentioned, it appears that line 114 of
StandardTokenizerImpl.jflex:
http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
needs to be updated to include the Extension B character range.
JFlex 1.4.1 (the latest release) does not support supplementary codepoints (those above the BMP - Basic Multilingual Plane:[U+0000-U+FFFF]), and CJK Ideograph Extension B is definitely asupplementary range - see the first column from<http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt> (theextent of this range is unchanged through the latest [beta] version,5.1.0):
    20000;<CJK Ideograph Extension B, First> ...
    2A6D6;<CJK Ideograph Extension B, Last> ...
I am working with Gerwin Klein on the development version of JFlex,and am hoping to get Level 1 [Regular Expression] Basic UnicodeSupport into the next release (see<http://unicode.org/reports/tr18/>) - among other things, thisentails accepting supplementary code points.
However, the next release of JFlex will require Java 1.5+, and Lucene2.X requires Java 1.4, so until Lucene reaches release 3.0 and beginsrequiring Java 1.5 (and Solr incorporates it), JFlex support ofsupplementary code points is moot.
In short, it'll probably be at least a year before theStandardTokenizer can be modified to accept supplementary characters,given the processes involved.
Steve

--

Christian WitternInstitute for Research in Humanities, Kyoto University

47 Higashiogura-cho, Kitashirakawa, Sakyo-ku, Kyoto 606-8265, JAPAN

Re: no support for CJK characters from Extension B in Solr

Reply via email to