kevcadieux created this revision. Herald added a project: All. kevcadieux requested review of this revision. Herald added a project: clang. Herald added a subscriber: cfe-commits.
This change fixes a clang-format unit test failure introduced by D124748 <https://reviews.llvm.org/D124748>. The `countLeadingWhitespace` function was calling `isspace` with values that could fall outside the valid input range. The valid input range for `isspace` is unsigned 0-255. Values outside this range produce undefined behavior, which on Windows manifests as an assertion being raised in the debug runtime libraries. `countLeadingWhitespace` was calling `isspace` with a signed char that could produce a negative value if the underlying byte's value was 128 or above, which can happen for non-ASCII encodings. The fix is to use `StringRef`'s `bytes_begin` and `bytes_end` iterators to read the values as unsigned chars instead. This bug can be reproduced by building the `check-clang-unit` target with a DEBUG configuration under Windows. This change is already covered by existing unit tests. Repository: rG LLVM Github Monorepo https://reviews.llvm.org/D128786 Files: clang/lib/Format/FormatTokenLexer.cpp Index: clang/lib/Format/FormatTokenLexer.cpp =================================================================== --- clang/lib/Format/FormatTokenLexer.cpp +++ clang/lib/Format/FormatTokenLexer.cpp @@ -864,8 +864,10 @@ // Directly using the regex turned out to be slow. With the regex // version formatting all files in this directory took about 1.25 // seconds. This version took about 0.5 seconds. - const char *Cur = Text.begin(); - while (Cur < Text.end()) { + const unsigned char *const Begin = Text.bytes_begin(); + const unsigned char *const End = Text.bytes_end(); + const unsigned char *Cur = Begin; + while (Cur < End) { if (isspace(Cur[0])) { ++Cur; } else if (Cur[0] == '\\' && (Cur[1] == '\n' || Cur[1] == '\r')) { @@ -874,20 +876,20 @@ // The source has a null byte at the end. So the end of the entire input // isn't reached yet. Also the lexer doesn't break apart an escaped // newline. - assert(Text.end() - Cur >= 2); + assert(End - Cur >= 2); Cur += 2; } else if (Cur[0] == '?' && Cur[1] == '?' && Cur[2] == '/' && (Cur[3] == '\n' || Cur[3] == '\r')) { // Newlines can also be escaped by a '?' '?' '/' trigraph. By the way, the // characters are quoted individually in this comment because if we write // them together some compilers warn that we have a trigraph in the code. - assert(Text.end() - Cur >= 4); + assert(End - Cur >= 4); Cur += 4; } else { break; } } - return Cur - Text.begin(); + return Cur - Begin; } FormatToken *FormatTokenLexer::getNextToken() {
Index: clang/lib/Format/FormatTokenLexer.cpp =================================================================== --- clang/lib/Format/FormatTokenLexer.cpp +++ clang/lib/Format/FormatTokenLexer.cpp @@ -864,8 +864,10 @@ // Directly using the regex turned out to be slow. With the regex // version formatting all files in this directory took about 1.25 // seconds. This version took about 0.5 seconds. - const char *Cur = Text.begin(); - while (Cur < Text.end()) { + const unsigned char *const Begin = Text.bytes_begin(); + const unsigned char *const End = Text.bytes_end(); + const unsigned char *Cur = Begin; + while (Cur < End) { if (isspace(Cur[0])) { ++Cur; } else if (Cur[0] == '\\' && (Cur[1] == '\n' || Cur[1] == '\r')) { @@ -874,20 +876,20 @@ // The source has a null byte at the end. So the end of the entire input // isn't reached yet. Also the lexer doesn't break apart an escaped // newline. - assert(Text.end() - Cur >= 2); + assert(End - Cur >= 2); Cur += 2; } else if (Cur[0] == '?' && Cur[1] == '?' && Cur[2] == '/' && (Cur[3] == '\n' || Cur[3] == '\r')) { // Newlines can also be escaped by a '?' '?' '/' trigraph. By the way, the // characters are quoted individually in this comment because if we write // them together some compilers warn that we have a trigraph in the code. - assert(Text.end() - Cur >= 4); + assert(End - Cur >= 4); Cur += 4; } else { break; } } - return Cur - Text.begin(); + return Cur - Begin; } FormatToken *FormatTokenLexer::getNextToken() {
_______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits