aaron.ballman added inline comments.
================ Comment at: clang/lib/Lex/Lexer.cpp:2702-2703 // (probably ending) '/' character. - if (CurPtr + 24 < BufferEnd && + // When diagnosing invalid UTF-8 sequences we always skip the fast + // vectorized path. + if (!WarnOnInvalidUtf8 && CurPtr + 24 < BufferEnd && ---------------- However, *this* seems like it could noticeably impact performance, especially for C code (including C header files in C++ mode), and may be worth getting measurements for. ================ Comment at: clang/lib/Lex/Lexer.cpp:2399 + getSourceLocation(CurPtr)); + bool UnicodeDecodeFailed = false; + ---------------- cor3ntin wrote: > aaron.ballman wrote: > > It looks like this can move into the `while` loop and we can remove some `= > > false` from within the loop? > The idea here is to not diagnose a contiguous sequence of invalid code units > more than once > The sequence `0x80x0x80x80` for example will only warn one for example. The > goal is to avoid noise when, for example there is a long comment in > Windows-1251 encoded cyrillic. it would warn one per word (because the spaces > would decode fine) rather than one per contiguous non-sense character. > It's also somewhat necessary because we can't know where the end of the > invalid "character" is, better to be greedy. > (The test file check that behavior) Oh derp, you're right (of course), sorry for the noise! ================ Comment at: clang/lib/Lex/Lexer.cpp:2407 + // diagnostic only once per sequence that cannot be decoded. + while ((!WarnOnInvalidUtf8 || isASCII(C)) && C != 0 && // Potentially EOF. + C != '\n' && C != '\r') { // Newline or DOS-style newline. ---------------- cor3ntin wrote: > aaron.ballman wrote: > > Do you have some idea of performance cost for enabling the warning with a > > large TU? Are we talking .5%? 1%? 5%? noise? > This adds 2 comparisons when the warning is enbled (is ASCII(c) is simply `C > < 0x80`) - so it should be noise, but I have not run benchmark. > It might be more noticeable within multi line comments as there the > optimization that vectorizes the comment skipping on SSE2 platforms is simply > skipped when this optimization is enabled. Ah, no need to run the benchmark then, I was thinking this would involve function call overhead and more complex checking for the `isASCII` call than exists in reality. ================ Comment at: clang/lib/Lex/Lexer.cpp:2417 + if (Length == 0) { + if (!UnicodeDecodeFailed) { + Diag(CurPtr, diag::warn_invalid_utf8_in_comment); ---------------- cor3ntin wrote: > aaron.ballman wrote: > > aaron.ballman wrote: > > > > > So if unicode decoding didn't fail... diagnose? :-) The naming here seems > > unfortunate. > `if(!UnicodeDecodeFailedInThePreviousLoopIteration)` ? I'm open to suggestion > here `UnicodeDecodingAlreadyDiagnosed` Repository: rG LLVM Github Monorepo CHANGES SINCE LAST ACTION https://reviews.llvm.org/D128059/new/ https://reviews.llvm.org/D128059 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits