On Wed, Dec 1, 2010 at 11:01 AM, Robert Muir <rcm...@gmail.com> wrote:

> (Jonathan, I apologize for emailing you twice, i meant to hit reply-all)
>
> On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind <rochk...@jhu.edu>
> wrote:
> >
> > Wait, standardtokenizer already handles CJK and will put each CJK char
> into
> > it's own token?  Really? I had no idea!  Is that documented anywhere, or
> you
> > just have to look at the source to see it?
> >
>
> Yes, you are right, the documentation should have been more explicit:
> in previous releases it doesn't say anything about how it tokenizes
> CJK in the documentation. But it does do them this way, and tagged
> them as "CJ" token type.
>
> I think the documentation issue is "fixed" in branch_3x and trunk:
>
>  * As of Lucene version 3.1, this class implements the Word Break rules
> from the
>  * Unicode Text Segmentation algorithm, as specified in
>  * <a href="http://unicode.org/reports/tr29/";>Unicode Standard Annex
> #29</a>.
> (from
> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java
> )
>
> So you can read the UAX#29 report and then you know how it tokenizes text
> You can also just use this demo app to see how the new one works:
> http://unicode.org/cldr/utility/breaks.jsp (choose "Word")
>

What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the
current stable StandardTokenizer handle CJK?

-- 
Jacob Elder
@jelder
(646) 535-3379

Reply via email to