[jira] [Commented] (LUCENE-10243) increase unicode versions of tokenizers to unicode 12.1

Robert Muir (Jira) Mon, 22 Nov 2021 19:12:05 -0800


    [ 
https://issues.apache.org/jira/browse/LUCENE-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447741#comment-17447741
 ]


Robert Muir commented on LUCENE-10243:
--------------------------------------

OK, I see my main problem with the generated conformance tests. The tests 
indicate expected breaks, and we convert those into expected tokens (with the 
perl script based on other unicode data files).

But since Unicode 11+, WordBreakProperty.txt no longer contains the old E_Base, 
etc properties, instead, we need to pull Extended_Pictographic from the 
emoji-data.txt in the perl script.

(cracks knuckles and prepares to fight perl)

> increase unicode versions of tokenizers to unicode 12.1
> -------------------------------------------------------
>
>                 Key: LUCENE-10243
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10243
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Robert Muir
>            Priority: Major
>
> Followup from LUCENE-10239
> Bump the Unicode version of these tokenizers from Unicode 9 to 12.1, which is 
> the most recent supported by the jflex release.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10243) increase unicode versions of tokenizers to unicode 12.1

Reply via email to