[ https://issues.apache.org/jira/browse/LUCENE-10243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447741#comment-17447741 ]
Robert Muir commented on LUCENE-10243: -------------------------------------- OK, I see my main problem with the generated conformance tests. The tests indicate expected breaks, and we convert those into expected tokens (with the perl script based on other unicode data files). But since Unicode 11+, WordBreakProperty.txt no longer contains the old E_Base, etc properties, instead, we need to pull Extended_Pictographic from the emoji-data.txt in the perl script. (cracks knuckles and prepares to fight perl) > increase unicode versions of tokenizers to unicode 12.1 > ------------------------------------------------------- > > Key: LUCENE-10243 > URL: https://issues.apache.org/jira/browse/LUCENE-10243 > Project: Lucene - Core > Issue Type: Task > Reporter: Robert Muir > Priority: Major > > Followup from LUCENE-10239 > Bump the Unicode version of these tokenizers from Unicode 9 to 12.1, which is > the most recent supported by the jflex release. -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org