On Wed, Dec 1, 2010 at 11:01 AM, Robert Muir <rcm...@gmail.com> wrote:
> (Jonathan, I apologize for emailing you twice, i meant to hit reply-all) > > On Wed, Dec 1, 2010 at 10:49 AM, Jonathan Rochkind <rochk...@jhu.edu> > wrote: > > > > Wait, standardtokenizer already handles CJK and will put each CJK char > into > > it's own token? Really? I had no idea! Is that documented anywhere, or > you > > just have to look at the source to see it? > > > > Yes, you are right, the documentation should have been more explicit: > in previous releases it doesn't say anything about how it tokenizes > CJK in the documentation. But it does do them this way, and tagged > them as "CJ" token type. > > I think the documentation issue is "fixed" in branch_3x and trunk: > > * As of Lucene version 3.1, this class implements the Word Break rules > from the > * Unicode Text Segmentation algorithm, as specified in > * <a href="http://unicode.org/reports/tr29/">Unicode Standard Annex > #29</a>. > (from > http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/standard/StandardTokenizer.java > ) > > So you can read the UAX#29 report and then you know how it tokenizes text > You can also just use this demo app to see how the new one works: > http://unicode.org/cldr/utility/breaks.jsp (choose "Word") > What does this mean to those of us on Solr 1.4 and Lucene 2.9.3? Does the current stable StandardTokenizer handle CJK? -- Jacob Elder @jelder (646) 535-3379