RE: no support for CJK characters from Extension B in Solr

Steven A Rowe Thu, 28 Feb 2008 09:09:28 -0800

On 02/28/2008 at 11:26 AM, Ken Krugler wrote:
> And as Erik mentioned, it appears that line 114 of
> StandardTokenizerImpl.jflex:
> 
> http://www.krugle.org/kse/files/svn/svn.apache.org/lucene/java/trunk/src/java/org/apache/lucene/analysis/standard/StandardTokenizerImpl.jflex
> 
> needs to be updated to include the Extension B character range.


JFlex 1.4.1 (the latest release) does not support supplementary code points 
(those above the BMP - Basic Multilingual Plane: [U+0000-U+FFFF]), and CJK 
Ideograph Extension B is definitely a supplementary range - see the first 
column from <http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0.txt> 
(the extent of this range is unchanged through the latest [beta] version, 
5.1.0):

        20000;<CJK Ideograph Extension B, First> ...
        2A6D6;<CJK Ideograph Extension B, Last> ...

I am working with Gerwin Klein on the development version of JFlex, and am 
hoping to get Level 1 [Regular Expression] Basic Unicode Support into the next 
release (see <http://unicode.org/reports/tr18/>) - among other things, this 
entails accepting supplementary code points. 

However, the next release of JFlex will require Java 1.5+, and Lucene 2.X 
requires Java 1.4, so until Lucene reaches release 3.0 and begins requiring 
Java 1.5 (and Solr incorporates it), JFlex support of supplementary code points 
is moot.

In short, it'll probably be at least a year before the StandardTokenizer can be 
modified to accept supplementary characters, given the processes involved.

Steve

RE: no support for CJK characters from Extension B in Solr

Reply via email to