This is an automated email from the ASF dual-hosted git repository.

ggregory pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/commons-lang.git


The following commit(s) were added to refs/heads/master by this push:
     new 665f047e5 [StringUtils::indexOfAnyBut] redesign due to 
inconsistent/faulty behaviour regarding UTF-16 surrogates (#1327)
665f047e5 is described below

commit 665f047e552ad71c189582af15a0b697133fff0b
Author: IBue <bc...@arcor.de>
AuthorDate: Mon Jan 6 23:22:42 2025 +0100

    [StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty behaviour 
regarding UTF-16 surrogates (#1327)
    
    * [StringUtils::indexOfAnyBut] redesign due to inconsistent/faulty…
    …behaviour regarding UTF-16 surrogates
    
    Both signatures of StringUtils::indexOfAnyBut currently behave
    inconsistently in matching UTF-16 supplementary characters and single
    UTF-16 surrogate characters (i.e. paired and unpaired surrogates), since
    they differ unnecessarily in their algorithmic implementations, use
    their own incomplete and faulty interpretation of UTF-16 and don't take
    full advantage of the standard library.
    
    The example cases below show that they may yield contradictory results
    or correct results for the wrong reasons.
    
    This proposal gives a unified algorithmic implementation of both
    signatures that
    a) is much easier to grasp due to a clear mathematical set approach and
       safe iteration and doesn't become entangled in index arithmetic;
       stresses the set semantics of the 2nd argument
    b) fully relies on the standard library for defined UTF-16
       handling/interpretation;
       paired surrogates are merged into one codepoint, unpaired surrogates
       are left as they are
    c) scales much better with input sizes and result index position
    d) can benefit from current and future improvements in the standard
       library and JVM
       (streams implementation, parallelization, JIT optimization, JEP 218,
       ???…)
    
    The algorithm boils down to:
    find index i of first char in cs such that
    (cs.codePointAt(i) ∈ {x ∈ codepoints(cs) ∣ x ∉
    codepoints(searchChars) })
    
    Examples:
    ---------
    
    <H>: high-surrogate character
    <L>: low-surrogate character
    (<H><L>): valid supplementary character
    signature 1: StringUtils::indexOfAnyBut(final CharSequence seq,
    final CharSequence searchChars)
    signature 2: StringUtils::indexOfAnyBut(final CharSequence cs,
    final char... searchChars)
    
    Case 1: matching of unpaired high-surrogate
    ---------seq/cs-------searchChars------exp./new-----sig.1-------sig.2---
    
     1.1     <H>aaaa      <H>abcd          !found       !found      !found
      sig.2: 'a' happens to follow <H> in searchChars;
      sig.1: 'a' is somewhere in searchChars
    
     1.2     <H>baaa      <H>abcd          !found       !found      0
      sig.1: 'b' is somewhere in searchChars
    
     1.3     <H>aaaa      (<H><L>)abcd     0            !found      0
      sig.1: 'a' is somewhere in searchChars
    
     1.4     aaaa<H>      (<H><L>)abcd     4            !found      !found
      sig.1+2 don't interpret suppl. character
    
    Case 2: matching of unpaired low-surrogate
    ---------seq/cs-------searchChars------exp./new-----sig.1-------sig.2---
    
     2.1     <L>aaaa      (<H><L>)abcd     0            !found      !found
      sig.1+2 don't interpret suppl. character
    
     2.2     aaaa<L>      (<H><L>)abcd     4            !found      !found
      sig.1+2 don't interpret suppl. character
    
    Case 3: matching of supplementary character
    ---------seq/cs-------------searchChars-----exp./new----sig.1-----sig.2-
    
     3.1     (<H><L>)aaaa       <L>ab<H>cd      0           !found    0
      sig.1: <L> is somewhere in searchChars
    
     3.2     (<H><L>)aaaa       abcd            0           1         0
      sig.1 always points to low-surrogate of (fully) unmatched
      suppl. character
    
     3.3     (<H><L>)aaaa       abcd<H>         0           0         1
     3.4     (<H><L>)aaaa       abcd<L>         0           !found    0
      sig.1: <H> skipped by algorithm
    
    * [StringUtils::indexOfAnyBut] further reduction of algorithm
    
    by simplifying set consideration:
    find index i of first char in seq such that (seq.codePointAt(i) ∉ { x ∈
    codepoints(searchChars) })
    
    * [StringUtils::indexOfAnyBut] simplify input-sequence iteration
    
    by transforming ListIterator loop into index-based loop,
    advancing by Character.charCount(codepoint);
    enabling short-circuit processing, avoiding full in-advance processing of
    input-sequence
    
    * [StringUtils:indexOfAnyBut] parameterization of test functions
    
    providing a single source-of-truth (arguments stream) for the two
    function variants
    
    * [StringUtils:indexOfAnyBut] remove comment
    
    Set::contains of immutable Set has unclear desastrous performance issues
    when searching for large values (here: >0xffff) in a set of smaller
    values (including JDK 23)
    
    ---------
    
    Co-authored-by: IBue <>
---
 .../java/org/apache/commons/lang3/StringUtils.java |  7 +-
 .../lang3/StringUtilsEqualsIndexOfTest.java        | 75 +++++++++++-----------
 2 files changed, 42 insertions(+), 40 deletions(-)

diff --git a/src/main/java/org/apache/commons/lang3/StringUtils.java 
b/src/main/java/org/apache/commons/lang3/StringUtils.java
index 8ee653375..25153c9ca 100644
--- a/src/main/java/org/apache/commons/lang3/StringUtils.java
+++ b/src/main/java/org/apache/commons/lang3/StringUtils.java
@@ -26,8 +26,10 @@ import java.util.Iterator;
 import java.util.List;
 import java.util.Locale;
 import java.util.Objects;
+import java.util.Set;
 import java.util.function.Supplier;
 import java.util.regex.Pattern;
+import java.util.stream.Collectors;
 
 import org.apache.commons.lang3.function.Suppliers;
 import org.apache.commons.lang3.stream.LangCollectors;
@@ -2853,11 +2855,12 @@ public class StringUtils {
         if (isEmpty(seq) || isEmpty(searchChars)) {
             return INDEX_NOT_FOUND;
         }
-        final int[] codePoints = searchChars.codePoints().sorted().toArray();
+        final Set<Integer> searchSetCodePoints = searchChars.codePoints()
+                .boxed().collect(Collectors.toSet());
         // advance character index from one interpreted codepoint to the next
         for (int curSeqCharIdx = 0; curSeqCharIdx < seq.length();) {
             final int curSeqCodePoint = Character.codePointAt(seq, 
curSeqCharIdx);
-            if (Arrays.binarySearch(codePoints, curSeqCodePoint) < 0) {
+            if (!searchSetCodePoints.contains(curSeqCodePoint)) {
                 return curSeqCharIdx;
             }
             curSeqCharIdx += Character.charCount(curSeqCodePoint); // skip 
indices to paired low-surrogates
diff --git 
a/src/test/java/org/apache/commons/lang3/StringUtilsEqualsIndexOfTest.java 
b/src/test/java/org/apache/commons/lang3/StringUtilsEqualsIndexOfTest.java
index 2a9e37c28..d84f81668 100644
--- a/src/test/java/org/apache/commons/lang3/StringUtilsEqualsIndexOfTest.java
+++ b/src/test/java/org/apache/commons/lang3/StringUtilsEqualsIndexOfTest.java
@@ -24,12 +24,17 @@ import static org.hamcrest.MatcherAssert.assertThat;
 import static org.junit.jupiter.api.Assertions.assertEquals;
 import static org.junit.jupiter.api.Assertions.assertFalse;
 import static org.junit.jupiter.api.Assertions.assertTrue;
+import static org.junit.jupiter.params.provider.Arguments.arguments;
 
 import java.nio.CharBuffer;
 import java.util.Locale;
+import java.util.stream.Stream;
 
 import org.hamcrest.core.IsNot;
 import org.junit.jupiter.api.Test;
+import org.junit.jupiter.params.ParameterizedTest;
+import org.junit.jupiter.params.provider.Arguments;
+import org.junit.jupiter.params.provider.MethodSource;
 
 /**
  * Tests {@link StringUtils} - Equals/IndexOf methods
@@ -90,6 +95,30 @@ public class StringUtilsEqualsIndexOfTest extends 
AbstractLangTest {
 
     private static final String[] FOOBAR_SUB_ARRAY = {"ob", "ba"};
 
+    static Stream<Arguments> indexOfAnyBut_withSurrogateChars() {
+        // @formatter:off
+        return Stream.of(
+            arguments(CharU20000 + CharU20001,    CharU20000, 2),
+            arguments(CharU20000 + CharU20001,    CharU20001, 0),
+            arguments(CharU20000 + CharU20001,    "abcd" + CharUSuppCharLow, 
0),
+            arguments(CharU20000 + CharU20001,    "abcd" + CharUSuppCharHigh, 
0),
+            arguments(CharU20000,                 CharU20000, -1),
+            arguments(CharU20000,                 CharU20001, 0),
+            arguments(CharUSuppCharHigh + "aaaa", CharUSuppCharHigh + "abcd", 
-1),
+            arguments(CharUSuppCharHigh + "baaa", CharUSuppCharHigh + "abcd", 
-1),
+            arguments(CharUSuppCharHigh + "aaaa", CharU20000 + "abcd", 0),
+            arguments("aaaa" + CharUSuppCharHigh, CharU20000 + "abcd", 4),
+            arguments(CharUSuppCharLow + "aaaa",  CharU20000 + "abcd", 0),
+            arguments("aaaa" + CharUSuppCharLow,  CharU20000 + "abcd", 4),
+            arguments(CharU20000 + "aaaa",        CharUSuppCharLow + "ab" + 
CharUSuppCharHigh + "cd", 0),
+            arguments(CharU20000 + "aaaa",        "abcd", 0),
+            arguments(CharU20000 + "aaaa",        "abcd" + CharUSuppCharHigh, 
0),
+            arguments(CharU20000 + "aaaa",        "abcd" + CharUSuppCharLow, 
0),
+            arguments("aaaa" + CharU20000,        CharU20000 + "abcd", -1)
+        );
+        // @formatter:on
+    }
+
     @Test
     public void testCompare_StringString() {
         assertEquals(0, StringUtils.compare(null, null));
@@ -445,25 +474,10 @@ public class StringUtilsEqualsIndexOfTest extends 
AbstractLangTest {
         assertEquals(0, StringUtils.indexOfAnyBut("aba", 'z'));
     }
 
-    @Test
-    public void testIndexOfAnyBut_StringCharArrayWithSurrogateChars() {
-        assertEquals(2, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, 
CharU20000.toCharArray()));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, 
CharU20001.toCharArray()));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, 
("abcd" + CharUSuppCharLow).toCharArray()));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, 
("abcd" + CharUSuppCharHigh).toCharArray()));
-        assertEquals(-1, StringUtils.indexOfAnyBut(CharU20000, 
CharU20000.toCharArray()));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000, 
CharU20001.toCharArray()));
-        assertEquals(-1, StringUtils.indexOfAnyBut(CharUSuppCharHigh + "aaaa", 
(CharUSuppCharHigh + "abcd").toCharArray()));
-        assertEquals(-1, StringUtils.indexOfAnyBut(CharUSuppCharHigh + "baaa", 
(CharUSuppCharHigh + "abcd").toCharArray()));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharUSuppCharHigh + "aaaa", 
(CharU20000 + "abcd").toCharArray()));
-        assertEquals(4, StringUtils.indexOfAnyBut("aaaa" + CharUSuppCharHigh, 
(CharU20000 + "abcd").toCharArray()));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharUSuppCharLow + "aaaa", 
(CharU20000 + "abcd").toCharArray()));
-        assertEquals(4, StringUtils.indexOfAnyBut("aaaa" + CharUSuppCharLow, 
(CharU20000 + "abcd").toCharArray()));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", 
(CharUSuppCharLow + "ab" + CharUSuppCharHigh + "cd").toCharArray()));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", 
"abcd".toCharArray()));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", ("abcd" 
+ CharUSuppCharHigh).toCharArray()));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", ("abcd" 
+ CharUSuppCharLow).toCharArray()));
-        assertEquals(-1, StringUtils.indexOfAnyBut("aaaa" + CharU20000, 
(CharU20000 + "abcd").toCharArray()));
+    @ParameterizedTest
+    @MethodSource("indexOfAnyBut_withSurrogateChars")
+    public void testIndexOfAnyBut_StringCharArrayWithSurrogateChars(final 
CharSequence seq, final String searchChars, final int expected) {
+        assertEquals(expected, StringUtils.indexOfAnyBut(seq, 
searchChars.toCharArray()));
     }
 
     @Test
@@ -483,25 +497,10 @@ public class StringUtilsEqualsIndexOfTest extends 
AbstractLangTest {
         assertEquals(0, StringUtils.indexOfAnyBut("ab", "z"));
     }
 
-    @Test
-    public void testIndexOfAnyBut_StringStringWithSurrogateChars() {
-        assertEquals(2, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, 
CharU20000));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, 
CharU20001));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, 
"abcd" + CharUSuppCharLow));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + CharU20001, 
"abcd" + CharUSuppCharHigh));
-        assertEquals(-1, StringUtils.indexOfAnyBut(CharU20000, CharU20000));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000, CharU20001));
-        assertEquals(-1, StringUtils.indexOfAnyBut(CharUSuppCharHigh + "aaaa", 
CharUSuppCharHigh + "abcd"));
-        assertEquals(-1, StringUtils.indexOfAnyBut(CharUSuppCharHigh + "baaa", 
CharUSuppCharHigh + "abcd"));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharUSuppCharHigh + "aaaa", 
CharU20000 + "abcd"));
-        assertEquals(4, StringUtils.indexOfAnyBut("aaaa" + CharUSuppCharHigh, 
CharU20000 + "abcd"));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharUSuppCharLow + "aaaa", 
CharU20000 + "abcd"));
-        assertEquals(4, StringUtils.indexOfAnyBut("aaaa" + CharUSuppCharLow, 
CharU20000 + "abcd"));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", 
CharUSuppCharLow + "ab" + CharUSuppCharHigh + "cd"));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", 
"abcd"));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", "abcd" 
+ CharUSuppCharHigh));
-        assertEquals(0, StringUtils.indexOfAnyBut(CharU20000 + "aaaa", "abcd" 
+ CharUSuppCharLow));
-        assertEquals(-1, StringUtils.indexOfAnyBut("aaaa" + CharU20000, 
CharU20000 + "abcd"));
+    @ParameterizedTest
+    @MethodSource("indexOfAnyBut_withSurrogateChars")
+    public void testIndexOfAnyBut_StringStringWithSurrogateChars(final 
CharSequence seq, final CharSequence searchChars, final int expected) {
+        assertEquals(expected, StringUtils.indexOfAnyBut(seq, searchChars));
     }
 
     @Test

Reply via email to