[clang] Treat escaped newlines as whitespace in Lexer::getRawToken. (PR #117548)

Samira Bazuzi via cfe-commits Mon, 25 Nov 2024 04:54:57 -0800

https://github.com/bazuzi created 
https://github.com/llvm/llvm-project/pull/117548


The Lexer used in getRawToken is not told to keep whitespace, so when it skips 
over escaped newlines, it also ignores whitespace, regardless of getRawToken's 
IgnoreWhiteSpace parameter. My suspicion is that users that want to not 
IgnoreWhiteSpace and therefore return true for a whitespace character would 
also safely accept true for an escaped newline. For users that do use 
IgnoreWhiteSpace, there is no behavior change, and the handling of escaped 
newlines is already correct.

If an escaped newline should not be considered whitespace, then instead of this 
change, getRawToken should be modified to return true when whitespace follows 
the escaped newline present at `Loc`, perhaps by using 
isWhitespace(SkipEscapedNewLines(StrData)[0]). However, this is incompatible 
with functions like clang::tidy::utils::lexer::getPreviousTokenAndStart. 
getPreviousTokenAndStart loops backwards through source location offsets, 
always decrementing by 1 without regard for potential character sizes larger 
than 1, such as escaped newlines. It seems more likely to me that there are 
more functions like this that would break than there are users who rely on 
escaped newlines not being treated as whitespace by getRawToken, but I'm open 
to that not being true.

The modified test was printing `\\nF` for the name of the expanded macro and 
now does not find a macro name. In my opinion, this is not an indication that 
the new behavior for getRawToken is incorrect. Rather, this is, both before and 
after this change, due to an incorrect storage of the backslash's source 
location as the spelling location of the expansion location of `F`.

>From 9c8b31dc266b770927785834c841b8ae5a7ebb58 Mon Sep 17 00:00:00 2001
From: Samira Bazuzi <baz...@google.com>
Date: Fri, 22 Nov 2024 15:45:55 -0500
Subject: [PATCH] Treat escaped newlines as whitespace in Lexer::getRawToken.

The Lexer used in getRawToken is not told to keep whitespace, so when it skips 
over escaped newlines, it also ignores whitespace, regardless of getRawToken's 
IgnoreWhiteSpace parameter. My suspicion is that users that want to not 
IgnoreWhiteSpace and therefore return true for a whitespace character would 
also safely accept true for an escaped newline. For users that do use 
IgnoreWhiteSpace, there is no behavior change, and the handling of escaped 
newlines is already correct.

If an escaped newline should not be considered whitespace, then instead of this 
change, getRawToken should be modified to return true when whitespace follows 
the escaped newline present at `Loc`, perhaps by using 
isWhitespace(SkipEscapedNewLines(StrData)[0]). However, this is incompatible 
with functions like clang::tidy::utils::lexer::getPreviousTokenAndStart. 
getPreviousTokenAndStart loops backwards through source location offsets, 
always decrementing by 1 without regard for potential character sizes larger 
than 1, such as escaped newlines. It seems more likely to me that there are 
more functions like this that would break than there are users who rely on 
escaped newlines not being treated as whitespace by getRawToken, but I'm open 
to that not being true.

The modified test was printing `\\nF` for the name of the expanded macro and 
now does not find a macro name. In my opinion, this is not an indication that 
the new behavior for getRawToken is incorrect. Rather, this is, both before and 
after this change, due to an incorrect storage of the backslash's source 
location as the spelling location of the expansion location of `F`.
---
 clang/lib/Lex/Lexer.cpp              | 4 +++-
 clang/test/Frontend/highlight-text.c | 3 +--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/clang/lib/Lex/Lexer.cpp b/clang/lib/Lex/Lexer.cpp
index e58c8bc72ae5b3..392cce6be0d171 100644
--- a/clang/lib/Lex/Lexer.cpp
+++ b/clang/lib/Lex/Lexer.cpp
@@ -527,7 +527,9 @@ bool Lexer::getRawToken(SourceLocation Loc, Token &Result,
 
   const char *StrData = Buffer.data()+LocInfo.second;
 
-  if (!IgnoreWhiteSpace && isWhitespace(StrData[0]))
+  if (!IgnoreWhiteSpace && (isWhitespace(StrData[0]) ||
+                            // Treat escaped newlines as whitespace.
+                            SkipEscapedNewLines(StrData) != StrData))
     return true;
 
   // Create a lexer starting at the beginning of this token.
diff --git a/clang/test/Frontend/highlight-text.c 
b/clang/test/Frontend/highlight-text.c
index a81d26caa4c24c..eefa4ebeec8ca4 100644
--- a/clang/test/Frontend/highlight-text.c
+++ b/clang/test/Frontend/highlight-text.c
@@ -12,8 +12,7 @@ int a = M;
 // CHECK-NEXT: :5:11: note: expanded from macro 'M'
 // CHECK-NEXT:     5 | #define M \
 // CHECK-NEXT:       |           ^
-// CHECK-NEXT: :3:14: note: expanded from macro '\
-// CHECK-NEXT: F'
+// CHECK-NEXT: :3:14: note: expanded from here
 // CHECK-NEXT:     3 | #define F (1 << 99)
 // CHECK-NEXT:       |              ^  ~~
 // CHECK-NEXT: :8:9: warning: shift count >= width of type 
[-Wshift-count-overflow]

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[clang] Treat escaped newlines as whitespace in Lexer::getRawToken. (PR #117548)

Reply via email to