[PATCH] D128059: [Clang] Add a warning on invalid UTF-8 in comments.

Aaron Ballman via Phabricator via cfe-commits Wed, 22 Jun 2022 04:36:48 -0700

aaron.ballman added inline comments.


================
Comment at: clang/lib/Lex/Lexer.cpp:2702-2703
     // (probably ending) '/' character.
-    if (CurPtr + 24 < BufferEnd &&
+    // When diagnosing invalid UTF-8 sequences we always skip the fast
+    // vectorized path.
+    if (!WarnOnInvalidUtf8 && CurPtr + 24 < BufferEnd &&
----------------
However, *this* seems like it could noticeably impact performance, especially 
for C code (including C header files in C++ mode), and may be worth getting 
measurements for.


================
Comment at: clang/lib/Lex/Lexer.cpp:2399
+                                      getSourceLocation(CurPtr));
+  bool UnicodeDecodeFailed = false;
+
----------------
cor3ntin wrote:
> aaron.ballman wrote:
> > It looks like this can move into the `while` loop and we can remove some `= 
> > false` from within the loop?
> The idea here is to not diagnose a contiguous sequence of invalid code units 
> more than once
> The sequence  `0x80x0x80x80` for example will only warn one for example. The 
> goal is to avoid noise when, for example there is a long comment in 
> Windows-1251 encoded cyrillic. it would warn one per word (because the spaces 
> would decode fine) rather than one per contiguous non-sense character.
> It's also somewhat necessary because we can't know where the end of the 
> invalid "character" is, better to be greedy. 
> (The test file check that behavior)
Oh derp, you're right (of course), sorry for the noise!


================
Comment at: clang/lib/Lex/Lexer.cpp:2407
+    // diagnostic only once per sequence that cannot be decoded.
+    while ((!WarnOnInvalidUtf8 || isASCII(C)) && C != 0 && // Potentially EOF.
+           C != '\n' && C != '\r') { // Newline or DOS-style newline.
----------------
cor3ntin wrote:
> aaron.ballman wrote:
> > Do you have some idea of performance cost for enabling the warning with a 
> > large TU? Are we talking .5%? 1%? 5%? noise?
> This adds 2 comparisons when the warning is enbled (is ASCII(c) is simply `C 
> < 0x80`) - so it should be noise, but I have not run benchmark.
> It might be more noticeable within multi line comments as there the 
> optimization that vectorizes the comment skipping on SSE2 platforms is simply 
> skipped when this optimization is enabled.
Ah, no need to run the benchmark then, I was thinking this would involve 
function call overhead and more complex checking for the `isASCII` call than 
exists in reality.


================
Comment at: clang/lib/Lex/Lexer.cpp:2417
+      if (Length == 0) {
+        if (!UnicodeDecodeFailed) {
+          Diag(CurPtr, diag::warn_invalid_utf8_in_comment);
----------------
cor3ntin wrote:
> aaron.ballman wrote:
> > aaron.ballman wrote:
> > > 
> > So if unicode decoding didn't fail... diagnose? :-) The naming here seems 
> > unfortunate.
> `if(!UnicodeDecodeFailedInThePreviousLoopIteration)` ? I'm open to suggestion 
> here
`UnicodeDecodingAlreadyDiagnosed`


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D128059/new/

https://reviews.llvm.org/D128059

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D128059: [Clang] Add a warning on invalid UTF-8 in comments.

Reply via email to