[ https://issues.apache.org/jira/browse/MNG-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Elliotte Rusty Harold resolved MNG-8241. ---------------------------------------- Fix Version/s: 4.0.0-rc-4 Resolution: Fixed > ComparableVersion incorrectly handles Unicode non-BMP characters > ---------------------------------------------------------------- > > Key: MNG-8241 > URL: https://issues.apache.org/jira/browse/MNG-8241 > Project: Maven > Issue Type: Bug > Reporter: Matthew Donoughe > Assignee: Elliotte Rusty Harold > Priority: Minor > Fix For: 4.0.0-rc-4 > > > Java strings are (usually) Unicode, but Java chars are a subset of Unicode. > ComparableVersion makes heavy use of > [String.charAt|https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/lang/String.html#charAt(int)], > which will return surrogate values instead of Unicode code points whenever a > character takes more than 16 bits. > This leads to the following behavior: > > {noformat} > java -jar > ~/.m2/repository/org/apache/maven/maven-artifact/3.9.4/maven-artifact-3.9.4.jar > 1 𝟤 > Display parameters as parsed by Maven (in canonical form and as a list of > tokens) and comparison result: > 1. 1 -> 1; tokens: [1] > 1 > 𝟤 > 2. 𝟤 -> 𝟤; tokens: [𝟤]{noformat} > 1 (DIGIT ONE) > 𝟤 (MATHEMATICAL SANS-SERIF DIGIT TWO) because > ComparableVersion sees 𝟤 as two invalid characters and treats it as text. > > > {noformat} > java -jar > ~/.m2/repository/org/apache/maven/maven-artifact/3.9.4/maven-artifact-3.9.4.jar > 0 𝟤 > Display parameters as parsed by Maven (in canonical form and as a list of > tokens) and comparison result: > 1. 0 -> ; tokens: [] > 0 < 𝟤 > 2. 𝟤 -> 𝟤; tokens: [𝟤]{noformat} > However, 0 (DIGIT 0) is still < 𝟤 (MATHEMATICAL SANS-SERIF DIGIT TWO). 0 < 𝟤 > < 1 the same way 0 < a < 1. > > It's unclear whether this should be considered to be a bug or whether it's > just an undocumented limitation. String.charAt and String.length should be > avoided unless you can be sure the characters are all BMP (Basic Multilingual > Plane). > I was initially worried that 𝟣𝟣𝟣𝟣𝟣 (MATHEMATICAL SANS-SERIF DIGIT ONE) > > 22222 (DIGIT TWO) because "𝟣𝟣𝟣𝟣𝟣".length is 10, greater than > MAX_INTITEM_LENGTH, but that code doesn't even get hit because String.charAt > is producing effectively "�����������". If the code is changed to identify > non-BMP Nd class digits like 𝟣 as digits then the code that determines the > required size of the data type needs to be updated to measure the length in > code points instead of chars. -- This message was sent by Atlassian Jira (v8.20.10#820010)