[jira] [Commented] (LUCENE-10243) increase unicode versions of tokenizers to unicode 12.1

Robert Muir (Jira) Mon, 22 Nov 2021 18:30:05 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447730#comment-17447730
 ]


Robert Muir commented on LUCENE-10243:
--------------------------------------

OK, I looked at this in more detail. Bumped to 10, tests pass. Bumped to 11, 
tests fail.

We just have to upgrade the grammar to reflect some changes to UAX#29 word 
breaks in Unicode 11 and Unicode 12:

http://www.unicode.org/reports/tr29/tr29-33.html#Modifications
* Added use of Extended_Pictographic from Emoji 11.0 (UTS #51) instead of from 
CLDR, simplified the rules by obsoleting E_Base, Glue_After_Zwj, E_Base_GAZ, 
and E_Modifier, merging E_Modifier into Extend.
* Added rule for not breaking within sequences of horizontal whitespace.
* Merged E_Modifier into Extend, and removed WB14.

http://www.unicode.org/reports/tr29/tr29-35.html#Modifications
* Added U+FF10..U+FF19 to Numeric

Some of this stuff (such as full-width numerics and extended pictographic 
usage), we were already doing. So hopefully it all just leads to 
simplifications :)

> increase unicode versions of tokenizers to unicode 12.1
> -------------------------------------------------------
>
>                 Key: LUCENE-10243
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10243
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Robert Muir
>            Priority: Major
>
> Followup from LUCENE-10239
> Bump the Unicode version of these tokenizers from Unicode 9 to 12.1, which is 
> the most recent supported by the jflex release.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10243) increase unicode versions of tokenizers to unicode 12.1

Reply via email to