May I please ping this one? Thanks... https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html
On Thu, Mar 2, 2023 at 6:21 PM Lewis Hyatt <lhy...@gmail.com> wrote: > > The PR complains that we do not handle UTF-8 in the suffix for a user-defined > literal, such as: > > bool operator ""_π (unsigned long long); > > In fact we don't handle any extended identifier characters there, whether > UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after > the "" tokens is included, since then the identifier is lexed in the "normal" > way as its own token. But when it is lexed as part of the string token, this > is handled in lex_string() with a one-off loop that is not aware of extended > characters. > > This patch fixes it by adding a new function scan_cur_identifier() that can be > used to lex an identifier while in the middle of lexing another token. > > BTW, the other place that has been mis-lexing identifiers is > lex_identifier_intern(), which is used to implement #pragma push_macro > and #pragma pop_macro. This does not support extended characters either. > I will add that in a subsequent patch, because it can't directly reuse the > new function, but rather needs to lex from a string instead of a cpp_buffer. > > With scan_cur_identifier(), we do also correctly warn about bidi and > normalization issues in the extended identifiers comprising the suffix. > > libcpp/ChangeLog: > > PR preprocessor/103902 > * lex.cc (identifier_diagnostics_on_lex): New function refactoring > some common code. > (lex_identifier_intern): Use the new function. > (lex_identifier): Don't run identifier diagnostics here, rather let > the call site do it when needed. > (_cpp_lex_direct): Adjust the call sites of lex_identifier () > acccordingly. > (struct scan_id_result): New struct. > (scan_cur_identifier): New function. > (create_literal2): New function. > (lit_accum::create_literal2): New function. > (is_macro): Folded into new function... > (maybe_ignore_udl_macro_suffix): ...here. > (is_macro_not_literal_suffix): Folded likewise. > (lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier > (). > (lex_string): Likewise. > > gcc/testsuite/ChangeLog: > > PR preprocessor/103902 > * g++.dg/cpp0x/udlit-extended-id-1.C: New test. > * g++.dg/cpp0x/udlit-extended-id-2.C: New test. > * g++.dg/cpp0x/udlit-extended-id-3.C: New test. > * g++.dg/cpp0x/udlit-extended-id-4.C: New test. > --- > > Notes: > Hello- > > This is the updated version of the patch, incorporating feedback from > Jakub > and Jason, most recently discussed here: > > https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html > > Please let me know how it looks? It is simpler than before with the new > approach. Thanks! > > One thing to note. As Jason clarified for me, a usage like this: > > #pragma GCC poison _x > const char * operator "" _x (const char *, unsigned long); > > The space between the "" and the _x is currently allowed but will be > deprecated in C++23. GCC currently will complain about the poisoned use of > _x in this case, and this patch, which is just focused on handling UTF-8 > properly, does not change this. But it seems that it would be correct > not to apply poison in this case. I can try to follow up with a patch to > do > so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not > worth it... > > For the time being, this patch does add a testcase for the above and > xfails > it. For the case where no space is present, which is the part touched by > the > present patch, existing behavior is preserved correctly and no diagnostics > such as poison are issued for the UDL suffix. (Contrary to v1 of this > patch.) > > Thanks! bootstrap + regtested all languages on x86-64 Linux with > no regressions. > > -Lewis > > .../g++.dg/cpp0x/udlit-extended-id-1.C | 68 ++++ > .../g++.dg/cpp0x/udlit-extended-id-2.C | 6 + > .../g++.dg/cpp0x/udlit-extended-id-3.C | 15 + > .../g++.dg/cpp0x/udlit-extended-id-4.C | 14 + > libcpp/lex.cc | 382 ++++++++++-------- > 5 files changed, 317 insertions(+), 168 deletions(-) > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C > create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C > > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > new file mode 100644 > index 00000000000..411d4fdd0ba > --- /dev/null > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C > @@ -0,0 +1,68 @@ > +// { dg-do run { target c++11 } } > +// { dg-additional-options "-Wno-error=normalized" } > +#include <cstring> > +using namespace std; > + > +constexpr unsigned long long operator "" _π (unsigned long long x) > +{ > + return 3 * x; > +} > + > +/* Historically we didn't parse properly as part of the "" token, so check > that > + as well. */ > +constexpr unsigned long long operator ""_Π2 (unsigned long long x) > +{ > + return 4 * x; > +} > + > +char x1[1_π]; > +char x2[2_Π2]; > + > +static_assert (sizeof x1 == 3, "test1"); > +static_assert (sizeof x2 == 8, "test2"); > + > +const char * operator "" _1σ (const char *s, unsigned long) > +{ > + return s + 1; > +} > + > +const char * operator ""_Σ2 (const char *s, unsigned long) > +{ > + return s + 2; > +} > + > +const char * operator "" _\U000000e61 (const char *s, unsigned long) > +{ > + return "ae"; > +} > + > +const char* operator ""_\u01532 (const char *s, unsigned long) > +{ > + return "oe"; > +} > + > +bool operator "" _\u0BC7\u0BBE (unsigned long long); // { dg-warning "not in > NFC" } > +bool operator ""_\u0B47\U00000B3E (unsigned long long); // { dg-warning "not > in NFC" } > + > +#define xτy > +const char * str = ""xτy; // { dg-warning "invalid suffix on literal" } > + > +int main() > +{ > + if (3_π != 9) > + __builtin_abort (); > + if (4_Π2 != 16) > + __builtin_abort (); > + if (strcmp ("abc"_1σ, "bc")) > + __builtin_abort (); > + if (strcmp ("abcd"_Σ2, "cd")) > + __builtin_abort (); > + if (strcmp (R"(abcdef)"_1σ, "bcdef")) > + __builtin_abort (); > + if (strcmp (R"(abcdef)"_Σ2, "cdef")) > + __builtin_abort (); > + if (strcmp ("xyz"_æ1, "ae")) > + __builtin_abort (); > + if (strcmp ("xyz"_œ2, "oe")) > + __builtin_abort (); > +} > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C > b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C > new file mode 100644 > index 00000000000..05a2804a463 > --- /dev/null > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C > @@ -0,0 +1,6 @@ > +// { dg-do compile { target c++11 } } > +// { dg-additional-options "-Wbidi-chars=any,ucn" } > +bool operator ""_d\u202ae\u202cf (unsigned long long); // { dg-line line1 } > +// { dg-error "universal character \\\\u202a is not valid in an identifier" > "test1" { target *-*-* } line1 } > +// { dg-error "universal character \\\\u202c is not valid in an identifier" > "test2" { target *-*-* } line1 } > +// { dg-warning "found problematic Unicode character" "test3" { target *-*-* > } line1 } > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C > b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C > new file mode 100644 > index 00000000000..11292e476e3 > --- /dev/null > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C > @@ -0,0 +1,15 @@ > +// { dg-do compile { target c++11 } } > + > +// Check that we do not look for poisoned identifier when it is a suffix. > +int _ħ; > +#pragma GCC poison _ħ > +const char * operator ""_ħ (const char *, unsigned long); // { dg-bogus > "poisoned" } > +bool operator ""_ħ (unsigned long long x); // { dg-bogus "poisoned" } > +bool b = 1_ħ; // { dg-bogus "poisoned" } > +const char *x = "hbar"_ħ; // { dg-bogus "poisoned" } > + > +/* Ideally, we should not warn here either, but this is not implemented yet. > This > + syntax has been deprecated for C++23. */ > +#pragma GCC poison _ħ2 > +const char * operator "" _ħ2 (const char *, unsigned long); // { dg-bogus > "poisoned" "" { xfail *-*-*} } > +const char *x2 = "hbar2"_ħ2; // { dg-bogus "poisoned" } > diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C > b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C > new file mode 100644 > index 00000000000..d1683c4d892 > --- /dev/null > +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C > @@ -0,0 +1,14 @@ > +// { dg-options "-std=c++98 -Wc++11-compat" } > +#define END ; > +#define εND ; > +#define EηD ; > +#define EN\u0394 ; > + > +const char *s1 = "s1"END // { dg-warning "requires a space between string > literal and macro" } > +const char *s2 = "s2"εND // { dg-warning "requires a space between string > literal and macro" } > +const char *s3 = "s3"EηD // { dg-warning "requires a space between string > literal and macro" } > +const char *s4 = "s4"ENΔ // { dg-warning "requires a space between string > literal and macro" } > + > +/* Make sure we did not skip the token also in the case that it wasn't found > to > + be a macro; compilation should fail here. */ > +const char *s5 = "s5"NØT_A_MACRO; // { dg-error "expected ',' or ';' before" > } > diff --git a/libcpp/lex.cc b/libcpp/lex.cc > index 45ea16a91bc..062935e2371 100644 > --- a/libcpp/lex.cc > +++ b/libcpp/lex.cc > @@ -2057,8 +2057,11 @@ warn_about_normalization (cpp_reader *pfile, > } > } > > -/* Returns TRUE if the sequence starting at buffer->cur is valid in > - an identifier. FIRST is TRUE if this starts an identifier. */ > +/* Returns TRUE if the byte sequence starting at buffer->cur is a valid > + extended character in an identifier. If FIRST is TRUE, then the character > + must be valid at the beginning of an identifier as well. If the return > + value is TRUE, then pfile->buffer->cur has been moved to point to the next > + byte after the extended character. */ > > static bool > forms_identifier_p (cpp_reader *pfile, int first, > @@ -2154,6 +2157,47 @@ maybe_va_opt_error (cpp_reader *pfile) > } > } > > +/* Helper function to perform diagnostics that are needed (rarely) > + when an identifier is lexed. */ > +static void > +identifier_diagnostics_on_lex (cpp_reader *pfile, cpp_hashnode *node) > +{ > + if (__builtin_expect (!(node->flags & NODE_DIAGNOSTIC) > + || pfile->state.skipping, 1)) > + return; > + > + /* It is allowed to poison the same identifier twice. */ > + if ((node->flags & NODE_POISONED) && !pfile->state.poisoned_ok) > + cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"", > + NODE_NAME (node)); > + > + /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the > + replacement list of a variadic macro. */ > + if (node == pfile->spec_nodes.n__VA_ARGS__ > + && !pfile->state.va_args_ok) > + { > + if (CPP_OPTION (pfile, cplusplus)) > + cpp_error (pfile, CPP_DL_PEDWARN, > + "__VA_ARGS__ can only appear in the expansion" > + " of a C++11 variadic macro"); > + else > + cpp_error (pfile, CPP_DL_PEDWARN, > + "__VA_ARGS__ can only appear in the expansion" > + " of a C99 variadic macro"); > + } > + > + /* __VA_OPT__ should only appear in the replacement list of a > + variadic macro. */ > + if (node == pfile->spec_nodes.n__VA_OPT__) > + maybe_va_opt_error (pfile); > + > + /* For -Wc++-compat, warn about use of C++ named operators. */ > + if (node->flags & NODE_WARN_OPERATOR) > + cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES, > + "identifier \"%s\" is a special operator name in C++", > + NODE_NAME (node)); > +} > + > /* Helper function to get the cpp_hashnode of the identifier BASE. */ > static cpp_hashnode * > lex_identifier_intern (cpp_reader *pfile, const uchar *base) > @@ -2173,41 +2217,7 @@ lex_identifier_intern (cpp_reader *pfile, const uchar > *base) > hash = HT_HASHFINISH (hash, len); > result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table, > base, len, hash, HT_ALLOC)); > - > - /* Rarely, identifiers require diagnostics when lexed. */ > - if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC) > - && !pfile->state.skipping, 0)) > - { > - /* It is allowed to poison the same identifier twice. */ > - if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok) > - cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"", > - NODE_NAME (result)); > - > - /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the > - replacement list of a variadic macro. */ > - if (result == pfile->spec_nodes.n__VA_ARGS__ > - && !pfile->state.va_args_ok) > - { > - if (CPP_OPTION (pfile, cplusplus)) > - cpp_error (pfile, CPP_DL_PEDWARN, > - "__VA_ARGS__ can only appear in the expansion" > - " of a C++11 variadic macro"); > - else > - cpp_error (pfile, CPP_DL_PEDWARN, > - "__VA_ARGS__ can only appear in the expansion" > - " of a C99 variadic macro"); > - } > - > - if (result == pfile->spec_nodes.n__VA_OPT__) > - maybe_va_opt_error (pfile); > - > - /* For -Wc++-compat, warn about use of C++ named operators. */ > - if (result->flags & NODE_WARN_OPERATOR) > - cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES, > - "identifier \"%s\" is a special operator name in C++", > - NODE_NAME (result)); > - } > - > + identifier_diagnostics_on_lex (pfile, result); > return result; > } > > @@ -2221,7 +2231,9 @@ _cpp_lex_identifier (cpp_reader *pfile, const char > *name) > return result; > } > > -/* Lex an identifier starting at BUFFER->CUR - 1. */ > +/* Lex an identifier starting at BASE. BUFFER->CUR is expected to point > + one past the first character at BASE, which may be a (possibly multi-byte) > + character if STARTS_UCN is true. */ > static cpp_hashnode * > lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn, > struct normalize_state *nst, cpp_hashnode **spelling) > @@ -2270,42 +2282,51 @@ lex_identifier (cpp_reader *pfile, const uchar *base, > bool starts_ucn, > *spelling = result; > } > > - /* Rarely, identifiers require diagnostics when lexed. */ > - if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC) > - && !pfile->state.skipping, 0)) > - { > - /* It is allowed to poison the same identifier twice. */ > - if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok) > - cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"", > - NODE_NAME (result)); > - > - /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the > - replacement list of a variadic macro. */ > - if (result == pfile->spec_nodes.n__VA_ARGS__ > - && !pfile->state.va_args_ok) > - { > - if (CPP_OPTION (pfile, cplusplus)) > - cpp_error (pfile, CPP_DL_PEDWARN, > - "__VA_ARGS__ can only appear in the expansion" > - " of a C++11 variadic macro"); > - else > - cpp_error (pfile, CPP_DL_PEDWARN, > - "__VA_ARGS__ can only appear in the expansion" > - " of a C99 variadic macro"); > - } > + return result; > +} > > - /* __VA_OPT__ should only appear in the replacement list of a > - variadic macro. */ > - if (result == pfile->spec_nodes.n__VA_OPT__) > - maybe_va_opt_error (pfile); > - > - /* For -Wc++-compat, warn about use of C++ named operators. */ > - if (result->flags & NODE_WARN_OPERATOR) > - cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES, > - "identifier \"%s\" is a special operator name in C++", > - NODE_NAME (result)); > - } > +/* Struct to hold the return value of the scan_cur_identifier () helper > + function below. */ > > +struct scan_id_result > +{ > + cpp_hashnode *node; > + normalize_state nst; > + > + scan_id_result () > + : node (nullptr) > + { > + nst = INITIAL_NORMALIZE_STATE; > + } > + > + explicit operator bool () const { return node; } > +}; > + > +/* Helper function to scan an entire identifier beginning at > + pfile->buffer->cur, and possibly containing extended characters (UCNs > + and/or UTF-8). Returns the cpp_hashnode for the identifier on success, or > + else nullptr, as well as a normalize_state so that normalization warnings > + may be issued once the token lexing is complete. */ > + > +static scan_id_result > +scan_cur_identifier (cpp_reader *pfile) > +{ > + const auto buffer = pfile->buffer; > + const auto begin = buffer->cur; > + scan_id_result result; > + if (ISIDST (*buffer->cur)) > + { > + ++buffer->cur; > + cpp_hashnode *ignore; > + result.node = lex_identifier (pfile, begin, false, &result.nst, > &ignore); > + } > + else if (forms_identifier_p (pfile, true, &result.nst)) > + { > + /* buffer->cur has been moved already by the call > + to forms_identifier_p. */ > + cpp_hashnode *ignore; > + result.node = lex_identifier (pfile, begin, true, &result.nst, > &ignore); > + } > return result; > } > > @@ -2365,6 +2386,24 @@ create_literal (cpp_reader *pfile, cpp_token *token, > const uchar *base, > token->val.str.text = cpp_alloc_token_string (pfile, base, len); > } > > +/* Like create_literal(), but construct it from two separate strings > + which are concatenated. LEN2 may be 0 if no second string is > + required. */ > +static void > +create_literal2 (cpp_reader *pfile, cpp_token *token, const uchar *base1, > + unsigned int len1, const uchar *base2, unsigned int len2, > + enum cpp_ttype type) > +{ > + token->type = type; > + token->val.str.len = len1 + len2; > + uchar *const dest = _cpp_unaligned_alloc (pfile, len1 + len2 + 1); > + memcpy (dest, base1, len1); > + if (len2) > + memcpy (dest+len1, base2, len2); > + dest[len1 + len2] = 0; > + token->val.str.text = dest; > +} > + > const uchar * > cpp_alloc_token_string (cpp_reader *pfile, > const unsigned char *ptr, unsigned len) > @@ -2403,6 +2442,11 @@ struct lit_accum { > rpos = NULL; > return c; > } > + > + void create_literal2 (cpp_reader *pfile, cpp_token *token, > + const uchar *base1, unsigned int len1, > + const uchar *base2, unsigned int len2, > + enum cpp_ttype type); > }; > > /* Subroutine of lex_raw_string: Append LEN chars from BASE to the buffer > @@ -2445,45 +2489,57 @@ lit_accum::read_begin (cpp_reader *pfile) > rpos = BUFF_FRONT (last); > } > > -/* Returns true if a macro has been defined. > - This might not work if compile with -save-temps, > - or preprocess separately from compilation. */ > +/* Helper function to check if a string format macro, say from inttypes.h, is > + placed touching a string literal, in which case it could be parsed as a > C++11 > + user-defined string literal thus breaking the program. User-defined > literals > + outside of namespace std must start with a single underscore, so assume > + anything of that form really is a UDL suffix. We don't need to worry > about > + UDLs defined inside namespace std because their names are reserved, so > cannot > + be used as macro names in valid programs. Return TRUE if the UDL should > be > + ignored for now and preserved for potential macro expansion. */ > > static bool > -is_macro(cpp_reader *pfile, const uchar *base) > +maybe_ignore_udl_macro_suffix (cpp_reader *pfile, location_t src_loc, > + const uchar *suffix_begin, cpp_hashnode *node) > { > - const uchar *cur = base; > - if (! ISIDST (*cur)) > + if ((suffix_begin[0] == '_' && suffix_begin[1] != '_') > + || !cpp_macro_p (node)) > return false; > - unsigned int hash = HT_HASHSTEP (0, *cur); > - ++cur; > - while (ISIDNUM (*cur)) > - { > - hash = HT_HASHSTEP (hash, *cur); > - ++cur; > - } > - hash = HT_HASHFINISH (hash, cur - base); > > - cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash > (pfile->hash_table, > - base, cur - base, hash, > HT_NO_INSERT)); > - > - return result && cpp_macro_p (result); > + /* Maybe raise a warning here; caller should arrange not to consume > + the tokens. */ > + if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping) > + cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, src_loc, 0, > + "invalid suffix on literal; C++11 requires a space > " > + "between literal and string macro"); > + return true; > } > > -/* Returns true if a literal suffix does not have the expected form > - and is defined as a macro. */ > - > -static bool > -is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base) > +/* Like create_literal2(), but also prepend all the accumulated data from > + the lit_accum struct. */ > +void > +lit_accum::create_literal2 (cpp_reader *pfile, cpp_token *token, > + const uchar *base1, unsigned int len1, > + const uchar *base2, unsigned int len2, > + enum cpp_ttype type) > { > - /* User-defined literals outside of namespace std must start with a single > - underscore, so assume anything of that form really is a UDL suffix. > - We don't need to worry about UDLs defined inside namespace std because > - their names are reserved, so cannot be used as macro names in valid > - programs. */ > - if (base[0] == '_' && base[1] != '_') > - return false; > - return is_macro (pfile, base); > + const unsigned int tot_len = accum + len1 + len2; > + uchar *dest = _cpp_unaligned_alloc (pfile, tot_len + 1); > + token->type = type; > + token->val.str.len = tot_len; > + token->val.str.text = dest; > + for (_cpp_buff *buf = first; buf; buf = buf->next) > + { > + size_t len = BUFF_FRONT (buf) - buf->base; > + memcpy (dest, buf->base, len); > + dest += len; > + } > + memcpy (dest, base1, len1); > + dest += len1; > + if (len2) > + memcpy (dest, base2, len2); > + dest += len2; > + *dest = '\0'; > } > > /* Lexes a raw string. The stored string contains the spelling, > @@ -2758,26 +2814,25 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, > const uchar *base) > > if (CPP_OPTION (pfile, user_literals)) > { > - /* If a string format macro, say from inttypes.h, is placed touching > - a string literal it could be parsed as a C++11 user-defined string > - literal thus breaking the program. */ > - if (is_macro_not_literal_suffix (pfile, pos)) > - { > - /* Raise a warning, but do not consume subsequent tokens. */ > - if (CPP_OPTION (pfile, warn_literal_suffix) && > !pfile->state.skipping) > - cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, > - token->src_loc, 0, > - "invalid suffix on literal; C++11 requires > " > - "a space between literal and string > macro"); > - } > - /* Grab user defined literal suffix. */ > - else if (ISIDST (*pos)) > - { > - type = cpp_userdef_string_add_type (type); > - ++pos; > + const uchar *const suffix_begin = pos; > + pfile->buffer->cur = pos; > > - while (ISIDNUM (*pos)) > - ++pos; > + if (const auto sr = scan_cur_identifier (pfile)) > + { > + if (maybe_ignore_udl_macro_suffix (pfile, token->src_loc, > + suffix_begin, sr.node)) > + pfile->buffer->cur = suffix_begin; > + else > + { > + type = cpp_userdef_string_add_type (type); > + accum.create_literal2 (pfile, token, base, suffix_begin - base, > + NODE_NAME (sr.node), NODE_LEN (sr.node), > + type); > + if (accum.first) > + _cpp_release_buff (pfile, accum.first); > + warn_about_normalization (pfile, token, &sr.nst, true); > + return; > + } > } > } > > @@ -2787,21 +2842,8 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, > const uchar *base) > create_literal (pfile, token, base, pos - base, type); > else > { > - size_t extra_len = pos - base; > - uchar *dest = _cpp_unaligned_alloc (pfile, accum.accum + extra_len + > 1); > - > - token->type = type; > - token->val.str.len = accum.accum + extra_len; > - token->val.str.text = dest; > - for (_cpp_buff *buf = accum.first; buf; buf = buf->next) > - { > - size_t len = BUFF_FRONT (buf) - buf->base; > - memcpy (dest, buf->base, len); > - dest += len; > - } > + accum.create_literal2 (pfile, token, base, pos - base, nullptr, 0, > type); > _cpp_release_buff (pfile, accum.first); > - memcpy (dest, base, extra_len); > - dest[extra_len] = '\0'; > } > } > > @@ -2908,39 +2950,40 @@ lex_string (cpp_reader *pfile, cpp_token *token, > const uchar *base) > cpp_error (pfile, CPP_DL_PEDWARN, "missing terminating %c character", > (int) terminator); > > + pfile->buffer->cur = cur; > + const uchar *const suffix_begin = cur; > + > if (CPP_OPTION (pfile, user_literals)) > { > - /* If a string format macro, say from inttypes.h, is placed touching > - a string literal it could be parsed as a C++11 user-defined string > - literal thus breaking the program. */ > - if (is_macro_not_literal_suffix (pfile, cur)) > - { > - /* Raise a warning, but do not consume subsequent tokens. */ > - if (CPP_OPTION (pfile, warn_literal_suffix) && > !pfile->state.skipping) > - cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, > - token->src_loc, 0, > - "invalid suffix on literal; C++11 requires > " > - "a space between literal and string > macro"); > - } > - /* Grab user defined literal suffix. */ > - else if (ISIDST (*cur)) > + if (const auto sr = scan_cur_identifier (pfile)) > { > - type = cpp_userdef_char_add_type (type); > - type = cpp_userdef_string_add_type (type); > - ++cur; > - > - while (ISIDNUM (*cur)) > - ++cur; > + if (maybe_ignore_udl_macro_suffix (pfile, token->src_loc, > + suffix_begin, sr.node)) > + pfile->buffer->cur = suffix_begin; > + else > + { > + /* Grab user defined literal suffix. */ > + type = cpp_userdef_char_add_type (type); > + type = cpp_userdef_string_add_type (type); > + create_literal2 (pfile, token, base, suffix_begin - base, > + NODE_NAME (sr.node), NODE_LEN (sr.node), type); > + warn_about_normalization (pfile, token, &sr.nst, true); > + return; > + } > } > } > else if (CPP_OPTION (pfile, cpp_warn_cxx11_compat) > - && is_macro (pfile, cur) > && !pfile->state.skipping) > - cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT, > - token->src_loc, 0, "C++11 requires a space " > - "between string literal and macro"); > + { > + const auto sr = scan_cur_identifier (pfile); > + /* Maybe raise a warning, but do not consume the tokens. */ > + pfile->buffer->cur = suffix_begin; > + if (sr && cpp_macro_p (sr.node)) > + cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT, > + token->src_loc, 0, "C++11 requires a space " > + "between string literal and macro"); > + } > > - pfile->buffer->cur = cur; > create_literal (pfile, token, base, cur - base, type); > } > > @@ -3915,9 +3958,10 @@ _cpp_lex_direct (cpp_reader *pfile) > result->type = CPP_NAME; > { > struct normalize_state nst = INITIAL_NORMALIZE_STATE; > - result->val.node.node = lex_identifier (pfile, buffer->cur - 1, false, > - &nst, > - &result->val.node.spelling); > + const auto node = lex_identifier (pfile, buffer->cur - 1, false, &nst, > + &result->val.node.spelling); > + result->val.node.node = node; > + identifier_diagnostics_on_lex (pfile, node); > warn_about_normalization (pfile, result, &nst, true); > } > > @@ -4220,8 +4264,10 @@ _cpp_lex_direct (cpp_reader *pfile) > if (forms_identifier_p (pfile, true, &nst)) > { > result->type = CPP_NAME; > - result->val.node.node = lex_identifier (pfile, base, true, &nst, > - > &result->val.node.spelling); > + const auto node = lex_identifier (pfile, base, true, &nst, > + &result->val.node.spelling); > + result->val.node.node = node; > + identifier_diagnostics_on_lex (pfile, node); > warn_about_normalization (pfile, result, &nst, true); > break; > } > @@ -4353,7 +4399,7 @@ cpp_digraph2name (enum cpp_ttype type) > } > > /* Write the spelling of an identifier IDENT, using UCNs, to BUFFER. > - The buffer must already contain the enough space to hold the > + The buffer must already contain enough space to hold the > token's spelling. Returns a pointer to the character after the > last character written. */ > unsigned char * > @@ -4375,7 +4421,7 @@ _cpp_spell_ident_ucns (unsigned char *buffer, > cpp_hashnode *ident) > } > > /* Write the spelling of a token TOKEN to BUFFER. The buffer must > - already contain the enough space to hold the token's spelling. > + already contain enough space to hold the token's spelling. > Returns a pointer to the character after the last character written. > FORSTRING is true if this is to be the spelling after translation > phase 1 (with the original spelling of extended identifiers), false