[ https://issues.apache.org/jira/browse/LUCENE-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447730#comment-17447730 ]
Robert Muir commented on LUCENE-10243: -------------------------------------- OK, I looked at this in more detail. Bumped to 10, tests pass. Bumped to 11, tests fail. We just have to upgrade the grammar to reflect some changes to UAX#29 word breaks in Unicode 11 and Unicode 12: http://www.unicode.org/reports/tr29/tr29-33.html#Modifications * Added use of Extended_Pictographic from Emoji 11.0 (UTS #51) instead of from CLDR, simplified the rules by obsoleting E_Base, Glue_After_Zwj, E_Base_GAZ, and E_Modifier, merging E_Modifier into Extend. * Added rule for not breaking within sequences of horizontal whitespace. * Merged E_Modifier into Extend, and removed WB14. http://www.unicode.org/reports/tr29/tr29-35.html#Modifications * Added U+FF10..U+FF19 to Numeric Some of this stuff (such as full-width numerics and extended pictographic usage), we were already doing. So hopefully it all just leads to simplifications :) > increase unicode versions of tokenizers to unicode 12.1 > ------------------------------------------------------- > > Key: LUCENE-10243 > URL: https://issues.apache.org/jira/browse/LUCENE-10243 > Project: Lucene - Core > Issue Type: Task > Reporter: Robert Muir > Priority: Major > > Followup from LUCENE-10239 > Bump the Unicode version of these tokenizers from Unicode 9 to 12.1, which is > the most recent supported by the jflex release. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org