Copilot commented on code in PR #4079:
URL: https://github.com/apache/logging-log4j2/pull/4079#discussion_r2983763152
##########
log4j-api-test/src/test/java/org/apache/logging/log4j/util/StringBuildersTest.java:
##########
@@ -79,15 +84,37 @@ void escapeJsonCharactersISOControl() {
assertEquals(jsonValueEscaped, sb.toString());
}
- @Test
- void escapeXMLCharactersCorrectly() {
- final String xmlValueNotEscaped = "<\"Salt&Peppa'\">";
- final String xmlValueEscaped =
"<"Salt&Peppa'">";
+ static Stream<Arguments> escapeXmlCharactersCorrectly() {
+ final char replacement = '\uFFFD';
+ return Stream.of(
+ // Empty
+ Arguments.of("", ""),
+ // characters that need to be escaped
+ Arguments.of("<\"Salt&Peppa'\">",
"<"Salt&Peppa'">"),
+ // control character replaced with U+FFFD
+ Arguments.of("A" + (char) 0x01 + "B", "A" + replacement + "B"),
+ // standalone low surrogate replaced with U+FFFD
+ Arguments.of("low" + Character.MIN_SURROGATE + "surrogate",
"low" + replacement + "surrogate"),
+ Arguments.of(Character.MIN_SURROGATE + "low", replacement +
"low"),
+ // standalone high surrogate replaced with U+FFFD
+ Arguments.of("high" + Character.MAX_SURROGATE + "surrogate",
"high" + replacement + "surrogate"),
+ Arguments.of(Character.MAX_SURROGATE + "high", replacement +
"high"),
Review Comment:
The surrogate tests are using `Character.MIN_SURROGATE`/`MAX_SURROGATE`, but
the comments label them as “low surrogate” and “high surrogate” respectively.
`MIN_SURROGATE` is the minimum *high* surrogate (0xD800) and `MAX_SURROGATE` is
the maximum *low* surrogate (0xDFFF), so these cases don’t actually exercise
the intended branches (especially the invalid low-surrogate handling). Consider
switching to `Character.MIN_LOW_SURROGATE`/`MAX_LOW_SURROGATE` and
`Character.MIN_HIGH_SURROGATE`/`MAX_HIGH_SURROGATE` (and/or adjust the
comments) so the tests match what they claim to cover.
```suggestion
Arguments.of("low" + Character.MIN_LOW_SURROGATE +
"surrogate", "low" + replacement + "surrogate"),
Arguments.of(Character.MIN_LOW_SURROGATE + "low",
replacement + "low"),
// standalone high surrogate replaced with U+FFFD
Arguments.of("high" + Character.MAX_HIGH_SURROGATE +
"surrogate", "high" + replacement + "surrogate"),
Arguments.of(Character.MAX_HIGH_SURROGATE + "high",
replacement + "high"),
```
##########
log4j-api/src/main/java/org/apache/logging/log4j/util/StringBuilders.java:
##########
@@ -369,8 +392,32 @@ public static void escapeXml(final StringBuilder
toAppendTo, final int start) {
toAppendTo.setCharAt(lastPos--, '&');
break;
default:
- toAppendTo.setCharAt(lastPos--, c);
+ toAppendTo.setCharAt(lastPos--, isValidXml10(c) ? c :
REPLACEMENT_CHAR);
}
}
}
+
+ /**
+ * Checks if a code point is a valid XML 1.0 character
+ *
+ * <p>This method is restricted to characters in the BMP, i.e. represented
by one UTF-16 code unit.</p>
+ *
+ * @param codePoint a code point
+ * @return {@code true} if it is a valid XML 1.0 code point
+ */
+ private static boolean isValidXml10(final char codePoint) {
+ // XML 1.0 valid characters (Fifth Edition):
+ // #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
+
+ // [#x20–#xD7FF] (placed early as a fast path for the most common case)
+ return (codePoint >= ' ' && codePoint < Character.MIN_SURROGATE)
+ // #x9
+ || codePoint == '\t'
+ // #xA
+ || codePoint == '\n'
+ // #xD
+ || codePoint == '\r'
+ // [#xE000-#xFFFD]
+ || (codePoint > Character.MAX_SURROGATE && codePoint <=
0xFFFD);
Review Comment:
`isValidXml10` is documented as operating on a “code point”, but the
signature takes a `char` (BMP-only). To avoid confusion, consider renaming the
parameter to something like `ch`/`c` and updating the Javadoc to explicitly say
“BMP char” rather than “code point”.
```suggestion
* Checks if a BMP {@code char} is a valid XML 1.0 character.
*
* <p>This method is restricted to characters in the BMP, i.e.
represented by one UTF-16 code unit.</p>
*
* @param ch a BMP {@code char} to validate
* @return {@code true} if it is a valid XML 1.0 character
*/
private static boolean isValidXml10(final char ch) {
// XML 1.0 valid characters (Fifth Edition):
// #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] |
[#x10000-#x10FFFF]
// [#x20–#xD7FF] (placed early as a fast path for the most common
case)
return (ch >= ' ' && ch < Character.MIN_SURROGATE)
// #x9
|| ch == '\t'
// #xA
|| ch == '\n'
// #xD
|| ch == '\r'
// [#xE000-#xFFFD]
|| (ch > Character.MAX_SURROGATE && ch <= 0xFFFD);
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]