https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91755
Bug ID: 91755 Summary: C++ handling of extended characters is not 100% correct Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: lhyatt at gmail dot com Target Milestone: --- In C++, technically extended characters (e.g. UTF-8) in the source are supposed to be converted to UCN escapes during translation phase 1. Thereafter it should not be detectable whether a UCN or the character itself was used (except in raw string literals where the conversion is reverted). GCC does not do this transformation. The distinction is not visible in too many places, but one such is in preprocessor stringizing. For instance: ========== #define stringize(x) #x static_assert(sizeof(stringize("π")) == sizeof(stringize("\U000003C0")), "oops"); ========== The above assert should not fire per the letter of the standard, but it does. I am not sure if it is necessarily desirable to fix this since the existing behavior seems more intuitive and matches other compilers. But the issue may become a little more prevalent soon -- as discussed in this thread: https://gcc.gnu.org/ml/gcc-patches/2019-09/msg00822.html, a patch will be applied in the near future that enables extended characters in identifiers too. Similar to the above case, stringizing such an identifier twice will also make visible the distinction between UCN- and direct-specified extended characters. In the new tests being added for this patch (gcc/testsuite/g++.dg/cpp/ucnid-2-utf8.C and gcc/testsuite/g++.dg/cpp/ucnid-3-utf8.C), we test that stringizing works for identifiers containing extended characters, but we test the existing behavior, which is technically not standard-conforming. So in order to memorialize the state of things, I am filing this bug report so that I can add a reference to the situation in the new test cases. If GCC behavior changes in the future, these new tests will fail and should be adapted to match.