[Bug c/67224] UTF-8 support for identifier names in GCC

joseph at codesourcery dot com Tue, 18 Aug 2015 16:14:30 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224

--- Comment #19 from joseph at codesourcery dot com <joseph at codesourcery dot 
com> ---
On Tue, 18 Aug 2015, ejolson at unr dot edu wrote:

> which illustrates that g++ does not process trigraphs inside raw string
> literals.  Admittedly I'm looking at the draft standard, but I don't think 
> this

As stated in [lex.pptoken] in both C++11 and C++14: "Between the initial 
and final double quote characters of the raw string, any transformations 
performed in phases 1 and 2 (trigraphs, universal-character-names, and 
line splicing) are reverted; this reversion shall apply before any d-char, 
r-char, or delimiting parenthesis is identified.".  Yes, the positioning 
of this in the standard may be confusing....

That is, the effect is more or less as if trigraphs weren't processed 
inside raw strings (but the implementation involves undoing trigraph 
substitutions, as described in the standard).

I think the right way to implement UTF-8 in identifiers involves making 
lex_identifier handle UTF-8 (when extended identifiers are enabled), and 
making _cpp_lex_direct handle bytes with the high bit set as 
potentially[*] starting identifiers (requiring the same handling of 
normalization state as for the other cases of characters starting 
identifiers, of course).  If you do that, then raw strings and all the 
corner cases of spelling preservation fall out naturally (though they 
still need testcases added to the testsuite).

[*] I think the right rule for C is that UTF-8 for a character not allowed 
in identifiers should produce a preprocessing token on its own rather than 
an error for an invalid character in an identifier (and similarly, such a 
character after the start of the identifier should terminate the 
identifier and produce such a preprocessing token).  Unless and until 
someone implements the C++ phase 1 conversion to UCNs, it would seem 
reasonable to follow this rule for C++ as well.

[Bug c/67224] UTF-8 support for identifier names in GCC

Reply via email to