[PATCH] D155610: [Clang][Sema] Fix display of characters on static assertion failure

Tom Honermann via Phabricator via cfe-commits Tue, 12 Sep 2023 12:55:16 -0700

tahonermann added inline comments.


================
Comment at: clang/test/SemaCXX/static-assert-cxx26.cpp:304
+static_assert('\u{9}' == (char)1, ""); // expected-error {{failed}} \
+                                       // expected-note {{evaluates to ''\t' 
(0x09, 9) == '<U+0001>' (0x01, 1)'}}
+static_assert((char8_t)-128 == (char8_t)-123, ""); // expected-error 
{{failed}} \
----------------
cor3ntin wrote:
> tahonermann wrote:
> > cor3ntin wrote:
> > > tahonermann wrote:
> > > > Is the expected note up to date? I don't see code that would generate 
> > > > the `<U+0001>` output. Am I just missing it? Since U+0001 is a valid, 
> > > > though non-printable, character, I would expect more `'\u0001'`.
> > > See elsewhere in the discussion. this formating is pre existing and 
> > > managed at the DiagnosticEngine level (pushEscapedString). the reason 
> > > it's not `\u0001` is 1/ to avoid  reusing c++ syntactic elements for 
> > > something that comes from diagnostics and is not represented as an 
> > > escaped sequence in source 2/ `\u00011` is unreadable, and `\U000000001` 
> > > is also not helpful :)
> > > 
> > Thanks for the explanation. I'm not sure that I agree with the rationale 
> > for (1) though. We're already putting the value in single quotes and 
> > representing some values with escapes in many of these cases when the value 
> > isn't produced by an escape sequence (or even a character/string literal); 
> > why exclude `\uXXXX`? I agree with the rationale for (2); we could use 
> > `'\u{1}'` in that case.
> FYI afaik the notation in clang predates the existence of \u{} by a few 
> years, and follow Unicode notation 
> (https://unicode.org/mail-arch/unicode-ml/y2005-m11/0060.html).
> Oldest instance seems to be 
> https://github.com/llvm/llvm-project/commit/77091b167fd959e1ee0c4dad4ec44de43b6c95db
>  - i followed suite when reworking the generic escaping mechanism all string 
> fed to diagnostics go through.
> 
> I don't care about changing the syntax, but i do hope we are consistent. 
> Ultimately what we are trying to do is to designate a unicode codepoint and 
> whether we do it through C++ syntax or not probably does not matter much as 
> long as it's clear, delimited and consistent!
I think the substitution of `<U+XXXX>` by the diagnostic engine itself is 
perfectly fine and good; particularly when it has no context to suggest a 
different presentation. In this particular case, where the character is being 
presented using C++ syntax as a character literal, I would prefer that C++ 
syntax be used consistently.

From an implementation standpoint, I'm suggesting that 
`WriteCharValueForDiagnostic()` be modified such that, if 
`escapeCStyle<EscapeChar::Single>()` returns an empty string, that the 
character be presented in `'\u{XXXX}'` form if the character is one that would 
otherwise be substituted by the diagnostic engine (e.g., if `isPrintable()` is 
false). Note that this would be restricted to `char` values <= 0x7F; larger 
values could still be passed through as invalid code units that the diagnostic 
engine would then render as, e.g., `'<FC>'`.


================
Comment at: clang/test/SemaCXX/static-assert.cpp:287
+  static_assert((char16_t)L'ゆ' == L"C̵̭̯̠̎͌ͅť̺"[1], ""); // expected-error 
{{failed}} \
+                                                  // expected-note {{evaluates 
to ''ゆ' (0x3086) == '̵' (0x335)'}}
+  static_assert(L"＼／"[1] == u'\xFFFD', ""); // expected-error {{failed}} \
----------------
cor3ntin wrote:
> hubert.reinterpretcast wrote:
> > hubert.reinterpretcast wrote:
> > > cor3ntin wrote:
> > > > hubert.reinterpretcast wrote:
> > > > > The C++23 escaped string formatting facility would not generate a 
> > > > > trailing combining character like this. I recommend following suit.
> > > > > 
> > > > > Info on U+0335: 
> > > > > https://util.unicode.org/UnicodeJsps/character.jsp?a=0335
> > > > > 
> > > > This is way outside the scope of the patch. The diagnostic output 
> > > > facility has no understanding of combining characters or graphemes and 
> > > > do not attempt to match std::print. It probably would be an improvement 
> > > > but this patch is not trying to modify how all diagnostics are printed. 
> > > > (all of that logic is in Diagnostic.cpp)
> > > This patch is pushing the envelope of what appears in diagnostics. One 
> > > can also argue that someone writing
> > > ```
> > > static_assert(false, "\u0301");
> > > ```
> > > gets what they deserve, but that case does not have a big problem anyway 
> > > (because the provided message text appears after `: `).
> > > 
> > > This patch increases the exposure of the diagnostic output facility to 
> > > input that it does not handle well. I disagree that it is outside the 
> > > scope of this patch to insist that it does not generate such inputs to 
> > > the diagnostic output facility (even if a possible solution is to modify 
> > > the diagnostic output facility first).
> > @cor3ntin, do you have status quo examples for how grapheme-extending 
> > characters that are not already "problematic" in their original context are 
> > emitted in diagnostics in contexts where they are?
> Are you looking for that sort of examples? https://godbolt.org/z/c79xWr7Me
> That shows that clang has no understanding of graphemes
gcc and MSVC get that case "right" (probably by accident). 
https://godbolt.org/z/Tjd6xnEon


CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D155610/new/

https://reviews.llvm.org/D155610

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D155610: [Clang][Sema] Fix display of characters on static assertion failure

Reply via email to