https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224
--- Comment #19 from joseph at codesourcery dot com <joseph at codesourcery dot com> --- On Tue, 18 Aug 2015, ejolson at unr dot edu wrote: > which illustrates that g++ does not process trigraphs inside raw string > literals. Admittedly I'm looking at the draft standard, but I don't think > this As stated in [lex.pptoken] in both C++11 and C++14: "Between the initial and final double quote characters of the raw string, any transformations performed in phases 1 and 2 (trigraphs, universal-character-names, and line splicing) are reverted; this reversion shall apply before any d-char, r-char, or delimiting parenthesis is identified.". Yes, the positioning of this in the standard may be confusing.... That is, the effect is more or less as if trigraphs weren't processed inside raw strings (but the implementation involves undoing trigraph substitutions, as described in the standard). I think the right way to implement UTF-8 in identifiers involves making lex_identifier handle UTF-8 (when extended identifiers are enabled), and making _cpp_lex_direct handle bytes with the high bit set as potentially[*] starting identifiers (requiring the same handling of normalization state as for the other cases of characters starting identifiers, of course). If you do that, then raw strings and all the corner cases of spelling preservation fall out naturally (though they still need testcases added to the testsuite). [*] I think the right rule for C is that UTF-8 for a character not allowed in identifiers should produce a preprocessing token on its own rather than an error for an invalid character in an identifier (and similarly, such a character after the start of the identifier should terminate the identifier and produce such a preprocessing token). Unless and until someone implements the C++ phase 1 conversion to UCNs, it would seem reasonable to follow this rule for C++ as well.