kevcadieux created this revision.
Herald added a project: All.
kevcadieux requested review of this revision.
Herald added a project: clang.
Herald added a subscriber: cfe-commits.

This change fixes a clang-format unit test failure introduced by D124748 
<https://reviews.llvm.org/D124748>. The `countLeadingWhitespace` function was 
calling `isspace` with values that could fall outside the valid input range. 
The valid input range for `isspace` is unsigned 0-255. Values outside this 
range produce undefined behavior, which on Windows manifests as an assertion 
being raised in the debug runtime libraries. `countLeadingWhitespace` was 
calling `isspace` with a signed char that could produce a negative value if the 
underlying byte's value was 128 or above, which can happen for non-ASCII 
encodings. The fix is to use `StringRef`'s `bytes_begin` and `bytes_end` 
iterators to read the values as unsigned chars instead.

This bug can be reproduced by building the `check-clang-unit` target with a 
DEBUG configuration under Windows. This change is already covered by existing 
unit tests.


Repository:
  rG LLVM Github Monorepo

https://reviews.llvm.org/D128786

Files:
  clang/lib/Format/FormatTokenLexer.cpp


Index: clang/lib/Format/FormatTokenLexer.cpp
===================================================================
--- clang/lib/Format/FormatTokenLexer.cpp
+++ clang/lib/Format/FormatTokenLexer.cpp
@@ -864,8 +864,10 @@
   // Directly using the regex turned out to be slow. With the regex
   // version formatting all files in this directory took about 1.25
   // seconds. This version took about 0.5 seconds.
-  const char *Cur = Text.begin();
-  while (Cur < Text.end()) {
+  const unsigned char *const Begin = Text.bytes_begin();
+  const unsigned char *const End = Text.bytes_end();
+  const unsigned char *Cur = Begin;
+  while (Cur < End) {
     if (isspace(Cur[0])) {
       ++Cur;
     } else if (Cur[0] == '\\' && (Cur[1] == '\n' || Cur[1] == '\r')) {
@@ -874,20 +876,20 @@
       // The source has a null byte at the end. So the end of the entire input
       // isn't reached yet. Also the lexer doesn't break apart an escaped
       // newline.
-      assert(Text.end() - Cur >= 2);
+      assert(End - Cur >= 2);
       Cur += 2;
     } else if (Cur[0] == '?' && Cur[1] == '?' && Cur[2] == '/' &&
                (Cur[3] == '\n' || Cur[3] == '\r')) {
       // Newlines can also be escaped by a '?' '?' '/' trigraph. By the way, 
the
       // characters are quoted individually in this comment because if we write
       // them together some compilers warn that we have a trigraph in the code.
-      assert(Text.end() - Cur >= 4);
+      assert(End - Cur >= 4);
       Cur += 4;
     } else {
       break;
     }
   }
-  return Cur - Text.begin();
+  return Cur - Begin;
 }
 
 FormatToken *FormatTokenLexer::getNextToken() {


Index: clang/lib/Format/FormatTokenLexer.cpp
===================================================================
--- clang/lib/Format/FormatTokenLexer.cpp
+++ clang/lib/Format/FormatTokenLexer.cpp
@@ -864,8 +864,10 @@
   // Directly using the regex turned out to be slow. With the regex
   // version formatting all files in this directory took about 1.25
   // seconds. This version took about 0.5 seconds.
-  const char *Cur = Text.begin();
-  while (Cur < Text.end()) {
+  const unsigned char *const Begin = Text.bytes_begin();
+  const unsigned char *const End = Text.bytes_end();
+  const unsigned char *Cur = Begin;
+  while (Cur < End) {
     if (isspace(Cur[0])) {
       ++Cur;
     } else if (Cur[0] == '\\' && (Cur[1] == '\n' || Cur[1] == '\r')) {
@@ -874,20 +876,20 @@
       // The source has a null byte at the end. So the end of the entire input
       // isn't reached yet. Also the lexer doesn't break apart an escaped
       // newline.
-      assert(Text.end() - Cur >= 2);
+      assert(End - Cur >= 2);
       Cur += 2;
     } else if (Cur[0] == '?' && Cur[1] == '?' && Cur[2] == '/' &&
                (Cur[3] == '\n' || Cur[3] == '\r')) {
       // Newlines can also be escaped by a '?' '?' '/' trigraph. By the way, the
       // characters are quoted individually in this comment because if we write
       // them together some compilers warn that we have a trigraph in the code.
-      assert(Text.end() - Cur >= 4);
+      assert(End - Cur >= 4);
       Cur += 4;
     } else {
       break;
     }
   }
-  return Cur - Text.begin();
+  return Cur - Begin;
 }
 
 FormatToken *FormatTokenLexer::getNextToken() {
_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

Reply via email to