Ping: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

Lewis Hyatt via Gcc-patches Tue, 02 May 2023 06:27:29 -0700

May I please ping this one? Thanks...
https://gcc.gnu.org/pipermail/gcc-patches/2023-March/613247.html


On Thu, Mar 2, 2023 at 6:21 PM Lewis Hyatt <lhy...@gmail.com> wrote:
>
> The PR complains that we do not handle UTF-8 in the suffix for a user-defined
> literal, such as:
>
> bool operator ""_π (unsigned long long);
>
> In fact we don't handle any extended identifier characters there, whether
> UTF-8, UCNs, or the $ sign. We do handle it fine if the optional space after
> the "" tokens is included, since then the identifier is lexed in the "normal"
> way as its own token. But when it is lexed as part of the string token, this
> is handled in lex_string() with a one-off loop that is not aware of extended
> characters.
>
> This patch fixes it by adding a new function scan_cur_identifier() that can be
> used to lex an identifier while in the middle of lexing another token.
>
> BTW, the other place that has been mis-lexing identifiers is
> lex_identifier_intern(), which is used to implement #pragma push_macro
> and #pragma pop_macro. This does not support extended characters either.
> I will add that in a subsequent patch, because it can't directly reuse the
> new function, but rather needs to lex from a string instead of a cpp_buffer.
>
> With scan_cur_identifier(), we do also correctly warn about bidi and
> normalization issues in the extended identifiers comprising the suffix.
>
> libcpp/ChangeLog:
>
>         PR preprocessor/103902
>         * lex.cc (identifier_diagnostics_on_lex): New function refactoring
>         some common code.
>         (lex_identifier_intern): Use the new function.
>         (lex_identifier): Don't run identifier diagnostics here, rather let
>         the call site do it when needed.
>         (_cpp_lex_direct): Adjust the call sites of lex_identifier ()
>         acccordingly.
>         (struct scan_id_result): New struct.
>         (scan_cur_identifier): New function.
>         (create_literal2): New function.
>         (lit_accum::create_literal2): New function.
>         (is_macro): Folded into new function...
>         (maybe_ignore_udl_macro_suffix): ...here.
>         (is_macro_not_literal_suffix): Folded likewise.
>         (lex_raw_string): Handle UTF-8 in UDL suffix via scan_cur_identifier 
> ().
>         (lex_string): Likewise.
>
> gcc/testsuite/ChangeLog:
>
>         PR preprocessor/103902
>         * g++.dg/cpp0x/udlit-extended-id-1.C: New test.
>         * g++.dg/cpp0x/udlit-extended-id-2.C: New test.
>         * g++.dg/cpp0x/udlit-extended-id-3.C: New test.
>         * g++.dg/cpp0x/udlit-extended-id-4.C: New test.
> ---
>
> Notes:
>     Hello-
>
>     This is the updated version of the patch, incorporating feedback from 
> Jakub
>     and Jason, most recently discussed here:
>
>     https://gcc.gnu.org/pipermail/gcc-patches/2023-February/612073.html
>
>     Please let me know how it looks? It is simpler than before with the new
>     approach. Thanks!
>
>     One thing to note. As Jason clarified for me, a usage like this:
>
>      #pragma GCC poison _x
>     const char * operator "" _x (const char *, unsigned long);
>
>     The space between the "" and the _x is currently allowed but will be
>     deprecated in C++23. GCC currently will complain about the poisoned use of
>     _x in this case, and this patch, which is just focused on handling UTF-8
>     properly, does not change this. But it seems that it would be correct
>     not to apply poison in this case. I can try to follow up with a patch to 
> do
>     so, if it seems worthwhile? Given the syntax is deprecated, maybe it's not
>     worth it...
>
>     For the time being, this patch does add a testcase for the above and 
> xfails
>     it. For the case where no space is present, which is the part touched by 
> the
>     present patch, existing behavior is preserved correctly and no diagnostics
>     such as poison are issued for the UDL suffix. (Contrary to v1 of this
>     patch.)
>
>     Thanks! bootstrap + regtested all languages on x86-64 Linux with
>     no regressions.
>
>     -Lewis
>
>  .../g++.dg/cpp0x/udlit-extended-id-1.C        |  68 ++++
>  .../g++.dg/cpp0x/udlit-extended-id-2.C        |   6 +
>  .../g++.dg/cpp0x/udlit-extended-id-3.C        |  15 +
>  .../g++.dg/cpp0x/udlit-extended-id-4.C        |  14 +
>  libcpp/lex.cc                                 | 382 ++++++++++--------
>  5 files changed, 317 insertions(+), 168 deletions(-)
>  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
>  create mode 100644 gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
>
> diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C 
> b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
> new file mode 100644
> index 00000000000..411d4fdd0ba
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-1.C
> @@ -0,0 +1,68 @@
> +// { dg-do run { target c++11 } }
> +// { dg-additional-options "-Wno-error=normalized" }
> +#include <cstring>
> +using namespace std;
> +
> +constexpr unsigned long long operator "" _π (unsigned long long x)
> +{
> +  return 3 * x;
> +}
> +
> +/* Historically we didn't parse properly as part of the "" token, so check 
> that
> +   as well.  */
> +constexpr unsigned long long operator ""_Π2 (unsigned long long x)
> +{
> +  return 4 * x;
> +}
> +
> +char x1[1_π];
> +char x2[2_Π2];
> +
> +static_assert (sizeof x1 == 3, "test1");
> +static_assert (sizeof x2 == 8, "test2");
> +
> +const char * operator "" _1σ (const char *s, unsigned long)
> +{
> +  return s + 1;
> +}
> +
> +const char * operator ""_Σ2 (const char *s, unsigned long)
> +{
> +  return s + 2;
> +}
> +
> +const char * operator "" _\U000000e61 (const char *s, unsigned long)
> +{
> +  return "ae";
> +}
> +
> +const char* operator ""_\u01532 (const char *s, unsigned long)
> +{
> +  return "oe";
> +}
> +
> +bool operator "" _\u0BC7\u0BBE (unsigned long long); // { dg-warning "not in 
> NFC" }
> +bool operator ""_\u0B47\U00000B3E (unsigned long long); // { dg-warning "not 
> in NFC" }
> +
> +#define xτy
> +const char * str = ""xτy; // { dg-warning "invalid suffix on literal" }
> +
> +int main()
> +{
> +  if (3_π != 9)
> +    __builtin_abort ();
> +  if (4_Π2 != 16)
> +    __builtin_abort ();
> +  if (strcmp ("abc"_1σ, "bc"))
> +    __builtin_abort ();
> +  if (strcmp ("abcd"_Σ2, "cd"))
> +    __builtin_abort ();
> +  if (strcmp (R"(abcdef)"_1σ, "bcdef"))
> +    __builtin_abort ();
> +  if (strcmp (R"(abcdef)"_Σ2, "cdef"))
> +    __builtin_abort ();
> +  if (strcmp ("xyz"_æ1, "ae"))
> +    __builtin_abort ();
> +  if (strcmp ("xyz"_œ2, "oe"))
> +    __builtin_abort ();
> +}
> diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C 
> b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
> new file mode 100644
> index 00000000000..05a2804a463
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-2.C
> @@ -0,0 +1,6 @@
> +// { dg-do compile { target c++11 } }
> +// { dg-additional-options "-Wbidi-chars=any,ucn" }
> +bool operator ""_d\u202ae\u202cf (unsigned long long); // { dg-line line1 }
> +// { dg-error "universal character \\\\u202a is not valid in an identifier" 
> "test1" { target *-*-* } line1 }
> +// { dg-error "universal character \\\\u202c is not valid in an identifier" 
> "test2" { target *-*-* } line1 }
> +// { dg-warning "found problematic Unicode character" "test3" { target *-*-* 
> } line1 }
> diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C 
> b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
> new file mode 100644
> index 00000000000..11292e476e3
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-3.C
> @@ -0,0 +1,15 @@
> +// { dg-do compile { target c++11 } }
> +
> +// Check that we do not look for poisoned identifier when it is a suffix.
> +int _ħ;
> +#pragma GCC poison _ħ
> +const char * operator ""_ħ (const char *, unsigned long); // { dg-bogus 
> "poisoned" }
> +bool operator ""_ħ (unsigned long long x); // { dg-bogus "poisoned" }
> +bool b = 1_ħ; // { dg-bogus "poisoned" }
> +const char *x = "hbar"_ħ; // { dg-bogus "poisoned" }
> +
> +/* Ideally, we should not warn here either, but this is not implemented yet. 
>  This
> +   syntax has been deprecated for C++23.  */
> +#pragma GCC poison _ħ2
> +const char * operator "" _ħ2 (const char *, unsigned long); // { dg-bogus 
> "poisoned" "" { xfail *-*-*} }
> +const char *x2 = "hbar2"_ħ2; // { dg-bogus "poisoned" }
> diff --git a/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C 
> b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
> new file mode 100644
> index 00000000000..d1683c4d892
> --- /dev/null
> +++ b/gcc/testsuite/g++.dg/cpp0x/udlit-extended-id-4.C
> @@ -0,0 +1,14 @@
> +// { dg-options "-std=c++98 -Wc++11-compat" }
> +#define END ;
> +#define εND ;
> +#define EηD ;
> +#define EN\u0394 ;
> +
> +const char *s1 = "s1"END // { dg-warning "requires a space between string 
> literal and macro" }
> +const char *s2 = "s2"εND // { dg-warning "requires a space between string 
> literal and macro" }
> +const char *s3 = "s3"EηD // { dg-warning "requires a space between string 
> literal and macro" }
> +const char *s4 = "s4"ENΔ // { dg-warning "requires a space between string 
> literal and macro" }
> +
> +/* Make sure we did not skip the token also in the case that it wasn't found 
> to
> +   be a macro; compilation should fail here.  */
> +const char *s5 = "s5"NØT_A_MACRO; // { dg-error "expected ',' or ';' before" 
> }
> diff --git a/libcpp/lex.cc b/libcpp/lex.cc
> index 45ea16a91bc..062935e2371 100644
> --- a/libcpp/lex.cc
> +++ b/libcpp/lex.cc
> @@ -2057,8 +2057,11 @@ warn_about_normalization (cpp_reader *pfile,
>      }
>  }
>
> -/* Returns TRUE if the sequence starting at buffer->cur is valid in
> -   an identifier.  FIRST is TRUE if this starts an identifier.  */
> +/* Returns TRUE if the byte sequence starting at buffer->cur is a valid
> +   extended character in an identifier.  If FIRST is TRUE, then the character
> +   must be valid at the beginning of an identifier as well.  If the return
> +   value is TRUE, then pfile->buffer->cur has been moved to point to the next
> +   byte after the extended character.  */
>
>  static bool
>  forms_identifier_p (cpp_reader *pfile, int first,
> @@ -2154,6 +2157,47 @@ maybe_va_opt_error (cpp_reader *pfile)
>      }
>  }
>
> +/* Helper function to perform diagnostics that are needed (rarely)
> +   when an identifier is lexed.  */
> +static void
> +identifier_diagnostics_on_lex (cpp_reader *pfile, cpp_hashnode *node)
> +{
> +  if (__builtin_expect (!(node->flags & NODE_DIAGNOSTIC)
> +                       || pfile->state.skipping, 1))
> +    return;
> +
> +  /* It is allowed to poison the same identifier twice.  */
> +  if ((node->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
> +    cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
> +              NODE_NAME (node));
> +
> +  /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
> +     replacement list of a variadic macro.  */
> +  if (node == pfile->spec_nodes.n__VA_ARGS__
> +      && !pfile->state.va_args_ok)
> +    {
> +      if (CPP_OPTION (pfile, cplusplus))
> +       cpp_error (pfile, CPP_DL_PEDWARN,
> +                  "__VA_ARGS__ can only appear in the expansion"
> +                  " of a C++11 variadic macro");
> +      else
> +       cpp_error (pfile, CPP_DL_PEDWARN,
> +                  "__VA_ARGS__ can only appear in the expansion"
> +                  " of a C99 variadic macro");
> +    }
> +
> +  /* __VA_OPT__ should only appear in the replacement list of a
> +     variadic macro.  */
> +  if (node == pfile->spec_nodes.n__VA_OPT__)
> +    maybe_va_opt_error (pfile);
> +
> +  /* For -Wc++-compat, warn about use of C++ named operators.  */
> +  if (node->flags & NODE_WARN_OPERATOR)
> +    cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
> +                "identifier \"%s\" is a special operator name in C++",
> +                NODE_NAME (node));
> +}
> +
>  /* Helper function to get the cpp_hashnode of the identifier BASE.  */
>  static cpp_hashnode *
>  lex_identifier_intern (cpp_reader *pfile, const uchar *base)
> @@ -2173,41 +2217,7 @@ lex_identifier_intern (cpp_reader *pfile, const uchar 
> *base)
>    hash = HT_HASHFINISH (hash, len);
>    result = CPP_HASHNODE (ht_lookup_with_hash (pfile->hash_table,
>                                               base, len, hash, HT_ALLOC));
> -
> -  /* Rarely, identifiers require diagnostics when lexed.  */
> -  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
> -                       && !pfile->state.skipping, 0))
> -    {
> -      /* It is allowed to poison the same identifier twice.  */
> -      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
> -       cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
> -                  NODE_NAME (result));
> -
> -      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
> -        replacement list of a variadic macro.  */
> -      if (result == pfile->spec_nodes.n__VA_ARGS__
> -         && !pfile->state.va_args_ok)
> -       {
> -         if (CPP_OPTION (pfile, cplusplus))
> -           cpp_error (pfile, CPP_DL_PEDWARN,
> -                      "__VA_ARGS__ can only appear in the expansion"
> -                      " of a C++11 variadic macro");
> -         else
> -           cpp_error (pfile, CPP_DL_PEDWARN,
> -                      "__VA_ARGS__ can only appear in the expansion"
> -                      " of a C99 variadic macro");
> -       }
> -
> -      if (result == pfile->spec_nodes.n__VA_OPT__)
> -       maybe_va_opt_error (pfile);
> -
> -      /* For -Wc++-compat, warn about use of C++ named operators.  */
> -      if (result->flags & NODE_WARN_OPERATOR)
> -       cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
> -                    "identifier \"%s\" is a special operator name in C++",
> -                    NODE_NAME (result));
> -    }
> -
> +  identifier_diagnostics_on_lex (pfile, result);
>    return result;
>  }
>
> @@ -2221,7 +2231,9 @@ _cpp_lex_identifier (cpp_reader *pfile, const char 
> *name)
>    return result;
>  }
>
> -/* Lex an identifier starting at BUFFER->CUR - 1.  */
> +/* Lex an identifier starting at BASE.  BUFFER->CUR is expected to point
> +   one past the first character at BASE, which may be a (possibly multi-byte)
> +   character if STARTS_UCN is true.  */
>  static cpp_hashnode *
>  lex_identifier (cpp_reader *pfile, const uchar *base, bool starts_ucn,
>                 struct normalize_state *nst, cpp_hashnode **spelling)
> @@ -2270,42 +2282,51 @@ lex_identifier (cpp_reader *pfile, const uchar *base, 
> bool starts_ucn,
>        *spelling = result;
>      }
>
> -  /* Rarely, identifiers require diagnostics when lexed.  */
> -  if (__builtin_expect ((result->flags & NODE_DIAGNOSTIC)
> -                       && !pfile->state.skipping, 0))
> -    {
> -      /* It is allowed to poison the same identifier twice.  */
> -      if ((result->flags & NODE_POISONED) && !pfile->state.poisoned_ok)
> -       cpp_error (pfile, CPP_DL_ERROR, "attempt to use poisoned \"%s\"",
> -                  NODE_NAME (result));
> -
> -      /* Constraint 6.10.3.5: __VA_ARGS__ should only appear in the
> -        replacement list of a variadic macro.  */
> -      if (result == pfile->spec_nodes.n__VA_ARGS__
> -         && !pfile->state.va_args_ok)
> -       {
> -         if (CPP_OPTION (pfile, cplusplus))
> -           cpp_error (pfile, CPP_DL_PEDWARN,
> -                      "__VA_ARGS__ can only appear in the expansion"
> -                      " of a C++11 variadic macro");
> -         else
> -           cpp_error (pfile, CPP_DL_PEDWARN,
> -                      "__VA_ARGS__ can only appear in the expansion"
> -                      " of a C99 variadic macro");
> -       }
> +  return result;
> +}
>
> -      /* __VA_OPT__ should only appear in the replacement list of a
> -        variadic macro.  */
> -      if (result == pfile->spec_nodes.n__VA_OPT__)
> -       maybe_va_opt_error (pfile);
> -
> -      /* For -Wc++-compat, warn about use of C++ named operators.  */
> -      if (result->flags & NODE_WARN_OPERATOR)
> -       cpp_warning (pfile, CPP_W_CXX_OPERATOR_NAMES,
> -                    "identifier \"%s\" is a special operator name in C++",
> -                    NODE_NAME (result));
> -    }
> +/* Struct to hold the return value of the scan_cur_identifier () helper
> +   function below.  */
>
> +struct scan_id_result
> +{
> +  cpp_hashnode *node;
> +  normalize_state nst;
> +
> +  scan_id_result ()
> +    : node (nullptr)
> +  {
> +    nst = INITIAL_NORMALIZE_STATE;
> +  }
> +
> +  explicit operator bool () const { return node; }
> +};
> +
> +/* Helper function to scan an entire identifier beginning at
> +   pfile->buffer->cur, and possibly containing extended characters (UCNs
> +   and/or UTF-8).  Returns the cpp_hashnode for the identifier on success, or
> +   else nullptr, as well as a normalize_state so that normalization warnings
> +   may be issued once the token lexing is complete.  */
> +
> +static scan_id_result
> +scan_cur_identifier (cpp_reader *pfile)
> +{
> +  const auto buffer = pfile->buffer;
> +  const auto begin = buffer->cur;
> +  scan_id_result result;
> +  if (ISIDST (*buffer->cur))
> +    {
> +      ++buffer->cur;
> +      cpp_hashnode *ignore;
> +      result.node = lex_identifier (pfile, begin, false, &result.nst, 
> &ignore);
> +    }
> +  else if (forms_identifier_p (pfile, true, &result.nst))
> +    {
> +      /* buffer->cur has been moved already by the call
> +        to forms_identifier_p.  */
> +      cpp_hashnode *ignore;
> +      result.node = lex_identifier (pfile, begin, true, &result.nst, 
> &ignore);
> +    }
>    return result;
>  }
>
> @@ -2365,6 +2386,24 @@ create_literal (cpp_reader *pfile, cpp_token *token, 
> const uchar *base,
>    token->val.str.text = cpp_alloc_token_string (pfile, base, len);
>  }
>
> +/* Like create_literal(), but construct it from two separate strings
> +   which are concatenated.  LEN2 may be 0 if no second string is
> +   required.  */
> +static void
> +create_literal2 (cpp_reader *pfile, cpp_token *token, const uchar *base1,
> +                unsigned int len1, const uchar *base2, unsigned int len2,
> +                enum cpp_ttype type)
> +{
> +  token->type = type;
> +  token->val.str.len = len1 + len2;
> +  uchar *const dest = _cpp_unaligned_alloc (pfile, len1 + len2 + 1);
> +  memcpy (dest, base1, len1);
> +  if (len2)
> +    memcpy (dest+len1, base2, len2);
> +  dest[len1 + len2] = 0;
> +  token->val.str.text = dest;
> +}
> +
>  const uchar *
>  cpp_alloc_token_string (cpp_reader *pfile,
>                         const unsigned char *ptr, unsigned len)
> @@ -2403,6 +2442,11 @@ struct lit_accum {
>        rpos = NULL;
>      return c;
>    }
> +
> +  void create_literal2 (cpp_reader *pfile, cpp_token *token,
> +                       const uchar *base1, unsigned int len1,
> +                       const uchar *base2, unsigned int len2,
> +                       enum cpp_ttype type);
>  };
>
>  /* Subroutine of lex_raw_string: Append LEN chars from BASE to the buffer
> @@ -2445,45 +2489,57 @@ lit_accum::read_begin (cpp_reader *pfile)
>    rpos = BUFF_FRONT (last);
>  }
>
> -/* Returns true if a macro has been defined.
> -   This might not work if compile with -save-temps,
> -   or preprocess separately from compilation.  */
> +/* Helper function to check if a string format macro, say from inttypes.h, is
> +   placed touching a string literal, in which case it could be parsed as a 
> C++11
> +   user-defined string literal thus breaking the program.  User-defined 
> literals
> +   outside of namespace std must start with a single underscore, so assume
> +   anything of that form really is a UDL suffix.  We don't need to worry 
> about
> +   UDLs defined inside namespace std because their names are reserved, so 
> cannot
> +   be used as macro names in valid programs.  Return TRUE if the UDL should 
> be
> +   ignored for now and preserved for potential macro expansion.  */
>
>  static bool
> -is_macro(cpp_reader *pfile, const uchar *base)
> +maybe_ignore_udl_macro_suffix (cpp_reader *pfile, location_t src_loc,
> +                              const uchar *suffix_begin, cpp_hashnode *node)
>  {
> -  const uchar *cur = base;
> -  if (! ISIDST (*cur))
> +  if ((suffix_begin[0] == '_' && suffix_begin[1] != '_')
> +      || !cpp_macro_p (node))
>      return false;
> -  unsigned int hash = HT_HASHSTEP (0, *cur);
> -  ++cur;
> -  while (ISIDNUM (*cur))
> -    {
> -      hash = HT_HASHSTEP (hash, *cur);
> -      ++cur;
> -    }
> -  hash = HT_HASHFINISH (hash, cur - base);
>
> -  cpp_hashnode *result = CPP_HASHNODE (ht_lookup_with_hash 
> (pfile->hash_table,
> -                                       base, cur - base, hash, 
> HT_NO_INSERT));
> -
> -  return result && cpp_macro_p (result);
> +  /* Maybe raise a warning here; caller should arrange not to consume
> +     the tokens.  */
> +  if (CPP_OPTION (pfile, warn_literal_suffix) && !pfile->state.skipping)
> +    cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX, src_loc, 0,
> +                          "invalid suffix on literal; C++11 requires a space 
> "
> +                          "between literal and string macro");
> +  return true;
>  }
>
> -/* Returns true if a literal suffix does not have the expected form
> -   and is defined as a macro.  */
> -
> -static bool
> -is_macro_not_literal_suffix(cpp_reader *pfile, const uchar *base)
> +/* Like create_literal2(), but also prepend all the accumulated data from
> +   the lit_accum struct.  */
> +void
> +lit_accum::create_literal2 (cpp_reader *pfile, cpp_token *token,
> +                           const uchar *base1, unsigned int len1,
> +                           const uchar *base2, unsigned int len2,
> +                           enum cpp_ttype type)
>  {
> -  /* User-defined literals outside of namespace std must start with a single
> -     underscore, so assume anything of that form really is a UDL suffix.
> -     We don't need to worry about UDLs defined inside namespace std because
> -     their names are reserved, so cannot be used as macro names in valid
> -     programs.  */
> -  if (base[0] == '_' && base[1] != '_')
> -    return false;
> -  return is_macro (pfile, base);
> +  const unsigned int tot_len = accum + len1 + len2;
> +  uchar *dest = _cpp_unaligned_alloc (pfile, tot_len + 1);
> +  token->type = type;
> +  token->val.str.len = tot_len;
> +  token->val.str.text = dest;
> +  for (_cpp_buff *buf = first; buf; buf = buf->next)
> +    {
> +      size_t len = BUFF_FRONT (buf) - buf->base;
> +      memcpy (dest, buf->base, len);
> +      dest += len;
> +    }
> +  memcpy (dest, base1, len1);
> +  dest += len1;
> +  if (len2)
> +    memcpy (dest, base2, len2);
> +  dest += len2;
> +  *dest = '\0';
>  }
>
>  /* Lexes a raw string.  The stored string contains the spelling,
> @@ -2758,26 +2814,25 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, 
> const uchar *base)
>
>    if (CPP_OPTION (pfile, user_literals))
>      {
> -      /* If a string format macro, say from inttypes.h, is placed touching
> -        a string literal it could be parsed as a C++11 user-defined string
> -        literal thus breaking the program.  */
> -      if (is_macro_not_literal_suffix (pfile, pos))
> -       {
> -         /* Raise a warning, but do not consume subsequent tokens.  */
> -         if (CPP_OPTION (pfile, warn_literal_suffix) && 
> !pfile->state.skipping)
> -           cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
> -                                  token->src_loc, 0,
> -                                  "invalid suffix on literal; C++11 requires 
> "
> -                                  "a space between literal and string 
> macro");
> -       }
> -      /* Grab user defined literal suffix.  */
> -      else if (ISIDST (*pos))
> -       {
> -         type = cpp_userdef_string_add_type (type);
> -         ++pos;
> +      const uchar *const suffix_begin = pos;
> +      pfile->buffer->cur = pos;
>
> -         while (ISIDNUM (*pos))
> -           ++pos;
> +      if (const auto sr = scan_cur_identifier (pfile))
> +       {
> +         if (maybe_ignore_udl_macro_suffix (pfile, token->src_loc,
> +                                            suffix_begin, sr.node))
> +             pfile->buffer->cur = suffix_begin;
> +         else
> +           {
> +             type = cpp_userdef_string_add_type (type);
> +             accum.create_literal2 (pfile, token, base, suffix_begin - base,
> +                                    NODE_NAME (sr.node), NODE_LEN (sr.node),
> +                                    type);
> +             if (accum.first)
> +               _cpp_release_buff (pfile, accum.first);
> +             warn_about_normalization (pfile, token, &sr.nst, true);
> +             return;
> +           }
>         }
>      }
>
> @@ -2787,21 +2842,8 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, 
> const uchar *base)
>      create_literal (pfile, token, base, pos - base, type);
>    else
>      {
> -      size_t extra_len = pos - base;
> -      uchar *dest = _cpp_unaligned_alloc (pfile, accum.accum + extra_len + 
> 1);
> -
> -      token->type = type;
> -      token->val.str.len = accum.accum + extra_len;
> -      token->val.str.text = dest;
> -      for (_cpp_buff *buf = accum.first; buf; buf = buf->next)
> -       {
> -         size_t len = BUFF_FRONT (buf) - buf->base;
> -         memcpy (dest, buf->base, len);
> -         dest += len;
> -       }
> +      accum.create_literal2 (pfile, token, base, pos - base, nullptr, 0, 
> type);
>        _cpp_release_buff (pfile, accum.first);
> -      memcpy (dest, base, extra_len);
> -      dest[extra_len] = '\0';
>      }
>  }
>
> @@ -2908,39 +2950,40 @@ lex_string (cpp_reader *pfile, cpp_token *token, 
> const uchar *base)
>      cpp_error (pfile, CPP_DL_PEDWARN, "missing terminating %c character",
>                (int) terminator);
>
> +  pfile->buffer->cur = cur;
> +  const uchar *const suffix_begin = cur;
> +
>    if (CPP_OPTION (pfile, user_literals))
>      {
> -      /* If a string format macro, say from inttypes.h, is placed touching
> -        a string literal it could be parsed as a C++11 user-defined string
> -        literal thus breaking the program.  */
> -      if (is_macro_not_literal_suffix (pfile, cur))
> -       {
> -         /* Raise a warning, but do not consume subsequent tokens.  */
> -         if (CPP_OPTION (pfile, warn_literal_suffix) && 
> !pfile->state.skipping)
> -           cpp_warning_with_line (pfile, CPP_W_LITERAL_SUFFIX,
> -                                  token->src_loc, 0,
> -                                  "invalid suffix on literal; C++11 requires 
> "
> -                                  "a space between literal and string 
> macro");
> -       }
> -      /* Grab user defined literal suffix.  */
> -      else if (ISIDST (*cur))
> +      if (const auto sr = scan_cur_identifier (pfile))
>         {
> -         type = cpp_userdef_char_add_type (type);
> -         type = cpp_userdef_string_add_type (type);
> -          ++cur;
> -
> -         while (ISIDNUM (*cur))
> -           ++cur;
> +         if (maybe_ignore_udl_macro_suffix (pfile, token->src_loc,
> +                                            suffix_begin, sr.node))
> +           pfile->buffer->cur = suffix_begin;
> +         else
> +           {
> +             /* Grab user defined literal suffix.  */
> +             type = cpp_userdef_char_add_type (type);
> +             type = cpp_userdef_string_add_type (type);
> +             create_literal2 (pfile, token, base, suffix_begin - base,
> +                              NODE_NAME (sr.node), NODE_LEN (sr.node), type);
> +             warn_about_normalization (pfile, token, &sr.nst, true);
> +             return;
> +           }
>         }
>      }
>    else if (CPP_OPTION (pfile, cpp_warn_cxx11_compat)
> -          && is_macro (pfile, cur)
>            && !pfile->state.skipping)
> -    cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
> -                          token->src_loc, 0, "C++11 requires a space "
> -                          "between string literal and macro");
> +    {
> +      const auto sr = scan_cur_identifier (pfile);
> +      /* Maybe raise a warning, but do not consume the tokens.  */
> +      pfile->buffer->cur = suffix_begin;
> +      if (sr && cpp_macro_p (sr.node))
> +       cpp_warning_with_line (pfile, CPP_W_CXX11_COMPAT,
> +                              token->src_loc, 0, "C++11 requires a space "
> +                              "between string literal and macro");
> +    }
>
> -  pfile->buffer->cur = cur;
>    create_literal (pfile, token, base, cur - base, type);
>  }
>
> @@ -3915,9 +3958,10 @@ _cpp_lex_direct (cpp_reader *pfile)
>        result->type = CPP_NAME;
>        {
>         struct normalize_state nst = INITIAL_NORMALIZE_STATE;
> -       result->val.node.node = lex_identifier (pfile, buffer->cur - 1, false,
> -                                               &nst,
> -                                               &result->val.node.spelling);
> +       const auto node = lex_identifier (pfile, buffer->cur - 1, false, &nst,
> +                                         &result->val.node.spelling);
> +       result->val.node.node = node;
> +       identifier_diagnostics_on_lex (pfile, node);
>         warn_about_normalization (pfile, result, &nst, true);
>        }
>
> @@ -4220,8 +4264,10 @@ _cpp_lex_direct (cpp_reader *pfile)
>         if (forms_identifier_p (pfile, true, &nst))
>           {
>             result->type = CPP_NAME;
> -           result->val.node.node = lex_identifier (pfile, base, true, &nst,
> -                                                   
> &result->val.node.spelling);
> +           const auto node = lex_identifier (pfile, base, true, &nst,
> +                                             &result->val.node.spelling);
> +           result->val.node.node = node;
> +           identifier_diagnostics_on_lex (pfile, node);
>             warn_about_normalization (pfile, result, &nst, true);
>             break;
>           }
> @@ -4353,7 +4399,7 @@ cpp_digraph2name (enum cpp_ttype type)
>  }
>
>  /* Write the spelling of an identifier IDENT, using UCNs, to BUFFER.
> -   The buffer must already contain the enough space to hold the
> +   The buffer must already contain enough space to hold the
>     token's spelling.  Returns a pointer to the character after the
>     last character written.  */
>  unsigned char *
> @@ -4375,7 +4421,7 @@ _cpp_spell_ident_ucns (unsigned char *buffer, 
> cpp_hashnode *ident)
>  }
>
>  /* Write the spelling of a token TOKEN to BUFFER.  The buffer must
> -   already contain the enough space to hold the token's spelling.
> +   already contain enough space to hold the token's spelling.
>     Returns a pointer to the character after the last character written.
>     FORSTRING is true if this is to be the spelling after translation
>     phase 1 (with the original spelling of extended identifiers), false

Ping: [PATCH v2] libcpp: Handle extended characters in user-defined literal suffix [PR103902]

Reply via email to