[ 
https://issues.apache.org/jira/browse/MNG-8241?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Elliotte Rusty Harold resolved MNG-8241.
----------------------------------------
    Fix Version/s: 4.0.0-rc-4
       Resolution: Fixed

> ComparableVersion incorrectly handles Unicode non-BMP characters
> ----------------------------------------------------------------
>
>                 Key: MNG-8241
>                 URL: https://issues.apache.org/jira/browse/MNG-8241
>             Project: Maven
>          Issue Type: Bug
>            Reporter: Matthew Donoughe
>            Assignee: Elliotte Rusty Harold
>            Priority: Minor
>             Fix For: 4.0.0-rc-4
>
>
> Java strings are (usually) Unicode, but Java chars are a subset of Unicode. 
> ComparableVersion makes heavy use of 
> [String.charAt|https://docs.oracle.com/en/java/javase/22/docs/api/java.base/java/lang/String.html#charAt(int)],
>  which will return surrogate values instead of Unicode code points whenever a 
> character takes more than 16 bits.
> This leads to the following behavior:
>  
> {noformat}
> java -jar 
> ~/.m2/repository/org/apache/maven/maven-artifact/3.9.4/maven-artifact-3.9.4.jar
>  1 𝟤
> Display parameters as parsed by Maven (in canonical form and as a list of 
> tokens) and comparison result:
> 1. 1 -> 1; tokens: [1]
>    1 > 𝟤
> 2. 𝟤 -> 𝟤; tokens: [𝟤]{noformat}
> 1 (DIGIT ONE) > 𝟤 (MATHEMATICAL SANS-SERIF DIGIT TWO) because 
> ComparableVersion sees 𝟤 as two invalid characters and treats it as text.
>  
>  
> {noformat}
> java -jar 
> ~/.m2/repository/org/apache/maven/maven-artifact/3.9.4/maven-artifact-3.9.4.jar
>  0 𝟤
> Display parameters as parsed by Maven (in canonical form and as a list of 
> tokens) and comparison result:
> 1. 0 -> ; tokens: []
>    0 < 𝟤
> 2. 𝟤 -> 𝟤; tokens: [𝟤]{noformat}
> However, 0 (DIGIT 0) is still < 𝟤 (MATHEMATICAL SANS-SERIF DIGIT TWO). 0 < 𝟤 
> < 1 the same way 0 < a < 1.
>  
> It's unclear whether this should be considered to be a bug or whether it's 
> just an undocumented limitation. String.charAt and String.length should be 
> avoided unless you can be sure the characters are all BMP (Basic Multilingual 
> Plane).
> I was initially worried that 𝟣𝟣𝟣𝟣𝟣 (MATHEMATICAL SANS-SERIF DIGIT ONE) > 
> 22222 (DIGIT TWO) because "𝟣𝟣𝟣𝟣𝟣".length is 10, greater than 
> MAX_INTITEM_LENGTH, but that code doesn't even get hit because String.charAt 
> is producing effectively "�����������". If the code is changed to identify 
> non-BMP Nd class digits like 𝟣 as digits then the code that determines the 
> required size of the data type needs to be updated to measure the length in 
> code points instead of chars.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to