[gcc r15-2322] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]

Jakub Jelinek via Gcc-cvs Thu, 25 Jul 2024 12:38:33 -0700

https://gcc.gnu.org/g:29341f21ce1eb7cdb8cd468e4ceb0d07cf2775e0


commit r15-2322-g29341f21ce1eb7cdb8cd468e4ceb0d07cf2775e0
Author: Jakub Jelinek <ja...@redhat.com>
Date:   Thu Jul 25 21:36:31 2024 +0200

    c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set 
[PR110343]
    
    The following patch implements the easy parts of the paper.
    When @$` are added to the basic character set, it means that
    R"@$`()@$`" should now be valid (here I've noticed most of the
    raw string tests were tested solely with -std=c++11 or -std=gnu++11
    and I've tried to change that), and on the other side even if
    by extension $ is allowed in identifiers, \u0024 or \U00000024
    or \u{24} should not be, similarly how \u0041 is not allowed.
    
    The paper in 3.1 claims though that
     #include <stdio.h>
    
     #define STR(x) #x
    
    int main()
    {
      printf("%s", STR(\u0060)); // U+0060 is ` GRAVE ACCENT
    }
    should have been accepted before this paper (and rejected after it),
    but g++ rejects it.
    
    I've tried to understand it, but am confused on what is the right
    behavior and why.
    
    Consider
     #define STR(x) #x
    const char *a = "\u00b7";
    const char *b = STR(\u00b7);
    const char *c = "\u0041";
    const char *d = STR(\u0041);
    const char *e = STR(a\u00b7);
    const char *f = STR(a\u0041);
    const char *g = STR(a \u00b7);
    const char *h = STR(a \u0041);
    const char *i = "\u066d";
    const char *j = STR(\u066d);
    const char *k = "\u0040";
    const char *l = STR(\u0040);
    const char *m = STR(a\u066d);
    const char *n = STR(a\u0040);
    const char *o = STR(a \u066d);
    const char *p = STR(a \u0040);
    
    Neither clang nor gcc emit any diagnostics on the a, c, i and k
    initializers, those are certainly valid (c is invalid in C23 though).  g++
    emits with -pedantic-errors errors on all the others, while clang++ on the
    ones with STR involving \u0041, \u0040 and a\u0066d.  The chosen values are
    \u0040 '@' as something being changed by this paper, \u0041 'A' as basic
    character set char valid in identifiers before/after, \u00b7 as an example
    of character which is pedantically valid in identifiers if not at the start
    and \u066d s something pedantically not valid in identifiers.
    
    Now, https://eel.is/c++draft/lex.charset#6 says that UCN used outside of a
    string/character literal which corresponds to basic character set character
    (or control character) is ill-formed, that would make d, f, h cases invalid
    for C++ and l, n, p cases invalid for C++26.
    
    https://eel.is/c++draft/lex.name states which characters can appear at the
    start of the identifier and which can appear after the start.  And
    https://eel.is/c++draft/lex.pptoken states that preprocessing-token is
    either identifier, or tons of other things, or "each non-whitespace
    character that cannot be one of the above"
    
    Then https://eel.is/c++draft/lex.pptoken#1 says that this last category is
    invalid if the preprocessing token is being converted into token.
    
    And https://eel.is/c++draft/lex.pptoken#2 includes "If any character not in
    the basic character set matches the last category, the program is
    ill-formed."
    
    Now, e.g.  for the C++23 STR(\u0040) case, \u0040 is there not in the basic
    character set, so valid outside of the literals (not the case anymore in
    C++26), but it isn't nondigit and doesn't have XID_Start property, so it
    isn't IMHO an identifier and so must be the "each non-whitespace character
    that cannot be one of the above" case.  Why doesn't the above mentioned
    https://eel.is/c++draft/lex.pptoken#2 sentence make that invalid?  Ignoring
    that, I'd say it would be then stringized and that feels like it is what
    clang++ is doing.  Now, e.g.  for the STR(a\u066d) case, I wonder why that
    isn't lexed as a identifier followed by \u066d "each non-whitespace
    character that cannot be one of the above" token and stringified similarly,
    clang++ rejects that.
    
    What GCC libcpp seems to be doing is that if that forms_identifier_p calls
    _cpp_valid_utf8 or _cpp_valid_ucn with an argument which tells it is first
    or second+ in identifier, and e.g.  _cpp_valid_ucn then for UCNs valid in
    string literals calls
      else if (identifier_pos)
        {
          int validity = ucn_valid_in_identifier (pfile, result, nst);
    
          if (validity == 0)
            cpp_error (pfile, CPP_DL_ERROR,
                       "universal character %.*s is not valid in an identifier",
                       (int) (str - base), base);
          else if (validity == 2 && identifier_pos == 1)
            cpp_error (pfile, CPP_DL_ERROR,
       "universal character %.*s is not valid at the start of an identifier",
                       (int) (str - base), base);
        }
    so basically all those invalid in identifiers cases emit an error and
    pretend to be valid in identifiers, rather than what e.g.  _cpp_valid_utf8
    does for C but not for C++ and only for the chars completely invalid in
    identifiers rather than just valid in identifiers but not at the start:
              /* In C++, this is an error for invalid character in an identifier
                 because logically, the UTF-8 was converted to a UCN during
                 translation phase 1 (even though we don't physically do it that
                 way).  In C, this byte rather becomes grammatically a separate
                 token.  */
    
              if (CPP_OPTION (pfile, cplusplus))
                cpp_error (pfile, CPP_DL_ERROR,
                           "extended character %.*s is not valid in an 
identifier",
                           (int) (*pstr - base), base);
              else
                {
                  *pstr = base;
                  return false;
                }
    The comment doesn't really match what is done in recent C++ versions because
    there UCNs are translated to characters and not the other way around.
    
    2024-07-25  Jakub Jelinek  <ja...@redhat.com>
    
            PR c++/110343
    libcpp/
            * lex.cc: C++26 P2558R2 - Add @, $, and ` to the basic character 
set.
            (lex_raw_string): For C++26 allow $@` characters in prefix.
            * charset.cc (_cpp_valid_ucn): For C++26 reject \u0024 in 
identifiers.
    gcc/testsuite/
            * c-c++-common/raw-string-1.c: Use { c || c++11 } effective target,
            remove c++ specific dg-options.
            * c-c++-common/raw-string-2.c: Likewise.
            * c-c++-common/raw-string-4.c: Likewise.
            * c-c++-common/raw-string-5.c: Likewise.  Expect some diagnostics
            only for non-c++26, for c++26 expect different.
            * c-c++-common/raw-string-6.c: Use { c || c++11 } effective target,
            remove c++ specific dg-options.
            * c-c++-common/raw-string-11.c: Likewise.
            * c-c++-common/raw-string-13.c: Likewise.
            * c-c++-common/raw-string-14.c: Likewise.
            * c-c++-common/raw-string-15.c: Use { c || c++11 } effective target,
            change c++ specific dg-options to just -Wtrigraphs.
            * c-c++-common/raw-string-16.c: Likewise.
            * c-c++-common/raw-string-17.c: Use { c || c++11 } effective target,
            remove c++ specific dg-options.
            * c-c++-common/raw-string-18.c: Use { c || c++11 } effective target,
            remove -std=c++11 from c++ specific dg-options.
            * c-c++-common/raw-string-19.c: Likewise.
            * g++.dg/cpp26/raw-string1.C: New test.
            * g++.dg/cpp26/raw-string2.C: New test.

Diff:
---
 gcc/testsuite/c-c++-common/raw-string-1.c  |  3 +--
 gcc/testsuite/c-c++-common/raw-string-11.c |  5 ++---
 gcc/testsuite/c-c++-common/raw-string-13.c |  3 +--
 gcc/testsuite/c-c++-common/raw-string-14.c |  3 +--
 gcc/testsuite/c-c++-common/raw-string-15.c |  4 ++--
 gcc/testsuite/c-c++-common/raw-string-16.c |  4 ++--
 gcc/testsuite/c-c++-common/raw-string-17.c |  3 +--
 gcc/testsuite/c-c++-common/raw-string-18.c |  4 ++--
 gcc/testsuite/c-c++-common/raw-string-19.c |  4 ++--
 gcc/testsuite/c-c++-common/raw-string-2.c  |  3 +--
 gcc/testsuite/c-c++-common/raw-string-4.c  |  3 +--
 gcc/testsuite/c-c++-common/raw-string-5.c  | 19 ++++++++++++-------
 gcc/testsuite/c-c++-common/raw-string-6.c  |  3 +--
 gcc/testsuite/g++.dg/cpp26/raw-string1.C   |  4 ++++
 gcc/testsuite/g++.dg/cpp26/raw-string2.C   |  7 +++++++
 libcpp/charset.cc                          |  7 ++++++-
 libcpp/lex.cc                              |  5 ++++-
 17 files changed, 50 insertions(+), 34 deletions(-)

diff --git a/gcc/testsuite/c-c++-common/raw-string-1.c 
b/gcc/testsuite/c-c++-common/raw-string-1.c
index 199a3c6c83f9..321b5afeaff2 100644
--- a/gcc/testsuite/c-c++-common/raw-string-1.c
+++ b/gcc/testsuite/c-c++-common/raw-string-1.c
@@ -1,7 +1,6 @@
-// { dg-do run }
+// { dg-do run { target { c || c++11 } } }
 // { dg-require-effective-target wchar }
 // { dg-options "-std=gnu99 -Wno-c++-compat -trigraphs" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
 
 #ifndef __cplusplus
 #include <wchar.h>
diff --git a/gcc/testsuite/c-c++-common/raw-string-11.c 
b/gcc/testsuite/c-c++-common/raw-string-11.c
index 19210c57452b..daa75f316abd 100644
--- a/gcc/testsuite/c-c++-common/raw-string-11.c
+++ b/gcc/testsuite/c-c++-common/raw-string-11.c
@@ -1,7 +1,7 @@
 // PR preprocessor/48740
+// { dg-do run { target { c || c++11 } } }
 // { dg-options "-std=gnu99 -trigraphs -save-temps" { target c } }
-// { dg-options "-std=c++0x -save-temps" { target c++ } }
-// { dg-do run }
+// { dg-options "-save-temps" { target c++ } }
 
 int main ()
 {
@@ -9,4 +9,3 @@ int main ()
                           "foo%sbar%sfred%sbob?""?""?""?""?",
                           sizeof ("foo%sbar%sfred%sbob?""?""?""?""?"));
 }
-
diff --git a/gcc/testsuite/c-c++-common/raw-string-13.c 
b/gcc/testsuite/c-c++-common/raw-string-13.c
index fa11edaa7aab..5ab9a4539558 100644
--- a/gcc/testsuite/c-c++-common/raw-string-13.c
+++ b/gcc/testsuite/c-c++-common/raw-string-13.c
@@ -1,8 +1,7 @@
 // PR preprocessor/57620
-// { dg-do run }
+// { dg-do run { target { c || c++11 } } }
 // { dg-require-effective-target wchar }
 // { dg-options "-std=gnu99 -Wno-c++-compat -trigraphs" { target c } }
-// { dg-options "-std=c++11" { target c++ } }
 
 #ifndef __cplusplus
 #include <wchar.h>
diff --git a/gcc/testsuite/c-c++-common/raw-string-14.c 
b/gcc/testsuite/c-c++-common/raw-string-14.c
index fba826c4c513..81f0fe9e1a55 100644
--- a/gcc/testsuite/c-c++-common/raw-string-14.c
+++ b/gcc/testsuite/c-c++-common/raw-string-14.c
@@ -1,7 +1,6 @@
 // PR preprocessor/57620
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
 // { dg-options "-std=gnu99 -trigraphs" { target c } }
-// { dg-options "-std=c++11" { target c++ } }
 
 const void *s0 = R"abc\
 def()abcdef" 0;
diff --git a/gcc/testsuite/c-c++-common/raw-string-15.c 
b/gcc/testsuite/c-c++-common/raw-string-15.c
index 1d101dc83935..cc9d393d07d3 100644
--- a/gcc/testsuite/c-c++-common/raw-string-15.c
+++ b/gcc/testsuite/c-c++-common/raw-string-15.c
@@ -1,8 +1,8 @@
 // PR preprocessor/57620
-// { dg-do run }
+// { dg-do run { target { c || c++11 } } }
 // { dg-require-effective-target wchar }
 // { dg-options "-std=gnu99 -Wno-c++-compat -Wtrigraphs" { target c } }
-// { dg-options "-std=gnu++11 -Wtrigraphs" { target c++ } }
+// { dg-options "-Wtrigraphs" { target c++ } }
 
 #ifndef __cplusplus
 #include <wchar.h>
diff --git a/gcc/testsuite/c-c++-common/raw-string-16.c 
b/gcc/testsuite/c-c++-common/raw-string-16.c
index 1bf16dd5a1ed..3ddbd8fb2ed3 100644
--- a/gcc/testsuite/c-c++-common/raw-string-16.c
+++ b/gcc/testsuite/c-c++-common/raw-string-16.c
@@ -1,7 +1,7 @@
 // PR preprocessor/57620
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
 // { dg-options "-std=gnu99 -Wtrigraphs" { target c } }
-// { dg-options "-std=gnu++11 -Wtrigraphs" { target c++ } }
+// { dg-options "-Wtrigraphs" { target c++ } }
 
 const void *s0 = R"abc\
 def()abcdef" 0;
diff --git a/gcc/testsuite/c-c++-common/raw-string-17.c 
b/gcc/testsuite/c-c++-common/raw-string-17.c
index 30df020082e4..48db8ca375fa 100644
--- a/gcc/testsuite/c-c++-common/raw-string-17.c
+++ b/gcc/testsuite/c-c++-common/raw-string-17.c
@@ -1,7 +1,6 @@
 /* PR preprocessor/57824 */
-/* { dg-do run } */
+/* { dg-do run { target { c || c++11 } } } */
 /* { dg-options "-std=gnu99" { target c } } */
-/* { dg-options "-std=c++11" { target c++ } } */
 
 #define S(s) s
 #define T(s) s "\n"
diff --git a/gcc/testsuite/c-c++-common/raw-string-18.c 
b/gcc/testsuite/c-c++-common/raw-string-18.c
index 6709946e0c56..d96639b80742 100644
--- a/gcc/testsuite/c-c++-common/raw-string-18.c
+++ b/gcc/testsuite/c-c++-common/raw-string-18.c
@@ -1,7 +1,7 @@
 /* PR preprocessor/57824 */
-/* { dg-do compile } */
+/* { dg-do compile { target { c || c++11 } } } */
 /* { dg-options "-std=gnu99 -fdump-tree-optimized-lineno" { target c } } */
-/* { dg-options "-std=c++11 -fdump-tree-optimized-lineno" { target c++ } } */
+/* { dg-options "-fdump-tree-optimized-lineno" { target c++ } } */
 
 const char x[] = R"(
 abc
diff --git a/gcc/testsuite/c-c++-common/raw-string-19.c 
b/gcc/testsuite/c-c++-common/raw-string-19.c
index 7ab9e6cbea6f..88c542084998 100644
--- a/gcc/testsuite/c-c++-common/raw-string-19.c
+++ b/gcc/testsuite/c-c++-common/raw-string-19.c
@@ -1,7 +1,7 @@
 /* PR preprocessor/57824 */
-/* { dg-do compile } */
+// { dg-do compile { target { c || c++11 } } }
 /* { dg-options "-std=gnu99 -fdump-tree-optimized-lineno -save-temps" { target 
c } } */
-/* { dg-options "-std=c++11 -fdump-tree-optimized-lineno -save-temps" { target 
c++ } } */
+/* { dg-options "-fdump-tree-optimized-lineno -save-temps" { target c++ } } */
 
 const char x[] = R"(
 abc
diff --git a/gcc/testsuite/c-c++-common/raw-string-2.c 
b/gcc/testsuite/c-c++-common/raw-string-2.c
index 6f2e37d47cab..9601c1de94f2 100644
--- a/gcc/testsuite/c-c++-common/raw-string-2.c
+++ b/gcc/testsuite/c-c++-common/raw-string-2.c
@@ -1,7 +1,6 @@
-// { dg-do run }
+// { dg-do run { target { c || c++11 } } }
 // { dg-require-effective-target wchar }
 // { dg-options "-std=gnu99 -Wno-c++-compat -trigraphs" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
 
 #ifndef __cplusplus
 #include <wchar.h>
diff --git a/gcc/testsuite/c-c++-common/raw-string-4.c 
b/gcc/testsuite/c-c++-common/raw-string-4.c
index 303233bb344e..4870ac4caa02 100644
--- a/gcc/testsuite/c-c++-common/raw-string-4.c
+++ b/gcc/testsuite/c-c++-common/raw-string-4.c
@@ -1,7 +1,6 @@
 // R is not applicable for character literals.
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
 // { dg-options "-std=gnu99" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
 
 const int      i0      = R'a'; // { dg-error "was not declared|undeclared" 
"undeclared" }
                // { dg-error "expected ',' or ';'" "expected" { target c } .-1 
}
diff --git a/gcc/testsuite/c-c++-common/raw-string-5.c 
b/gcc/testsuite/c-c++-common/raw-string-5.c
index dbf31333213d..1bb4a3072e81 100644
--- a/gcc/testsuite/c-c++-common/raw-string-5.c
+++ b/gcc/testsuite/c-c++-common/raw-string-5.c
@@ -1,6 +1,5 @@
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
 // { dg-options "-std=gnu99" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
 
 const void *s0 = R"0123456789abcdefg()0123456789abcdefg" 0;
        // { dg-error "raw string delimiter longer" "longer" { target *-*-* } 
.-1 }
@@ -15,12 +14,18 @@ const void *s3 = R")())" 0;
        // { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
        // { dg-error "stray" "stray" { target *-*-* } .-2 }
 const void *s4 = R"@()@" 0;
-       // { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
-       // { dg-error "stray" "stray" { target *-*-* } .-2 }
+       // { dg-error "invalid character" "invalid" { target { c || c++23_down 
} } .-1 }
+       // { dg-error "stray" "stray" { target { c || c++23_down } } .-2 }
+       // { dg-error "before numeric constant" "numeric" { target c++26 } .-3 }
 const void *s5 = R"$()$" 0;
-       // { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
-       // { dg-error "stray" "stray" { target *-*-* } .-2 }
-const void *s6 = R"\u0040()\u0040" 0;
+       // { dg-error "invalid character" "invalid" { target { c || c++23_down 
} } .-1 }
+       // { dg-error "stray" "stray" { target { c || c++23_down } } .-2 }
+       // { dg-error "before numeric constant" "numeric" { target c++26 } .-3 }
+const void *s6 = R"`()`" 0;
+       // { dg-error "invalid character" "invalid" { target { c || c++23_down 
} } .-1 }
+       // { dg-error "stray" "stray" { target { c || c++23_down } } .-2 }
+       // { dg-error "before numeric constant" "numeric" { target c++26 } .-3 }
+const void *s7 = R"\u0040()\u0040" 0;
        // { dg-error "invalid character" "invalid" { target *-*-* } .-1 }
        // { dg-error "stray" "stray" { target *-*-* } .-2 }
 
diff --git a/gcc/testsuite/c-c++-common/raw-string-6.c 
b/gcc/testsuite/c-c++-common/raw-string-6.c
index 819dd44aff46..d8a5ac0e158d 100644
--- a/gcc/testsuite/c-c++-common/raw-string-6.c
+++ b/gcc/testsuite/c-c++-common/raw-string-6.c
@@ -1,6 +1,5 @@
-// { dg-do compile }
+// { dg-do compile { target { c || c++11 } } }
 // { dg-options "-std=gnu99" { target c } }
-// { dg-options "-std=c++0x" { target c++ } }
 
 const void *s0 = R"ouch()ouCh";        // { dg-error "unterminated raw string" 
"unterminated" }
 // { dg-error "at end of input" "end" { target *-*-* } .-1 }
diff --git a/gcc/testsuite/g++.dg/cpp26/raw-string1.C 
b/gcc/testsuite/g++.dg/cpp26/raw-string1.C
new file mode 100644
index 000000000000..1040c704ec9f
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp26/raw-string1.C
@@ -0,0 +1,4 @@
+// C++26 P2558R2 - Add @, $, and ` to the basic character set
+// { dg-do compile { target c++26 } }
+
+const char *s0 = R"`@$$@`@`$()`@$$@`@`$";
diff --git a/gcc/testsuite/g++.dg/cpp26/raw-string2.C 
b/gcc/testsuite/g++.dg/cpp26/raw-string2.C
new file mode 100644
index 000000000000..a756290f8209
--- /dev/null
+++ b/gcc/testsuite/g++.dg/cpp26/raw-string2.C
@@ -0,0 +1,7 @@
+// C++26 P2558R2 - Add @, $, and ` to the basic character set
+// { dg-do compile { target { ! { avr*-*-* mmix*-*-* *-*-aix* } } } }
+// { dg-options "" }
+
+int a$b;
+int a\u0024c;          // { dg-error "universal character \\\\u0024 is not 
valid in an identifier" "" { target c++26 } }
+int a\U00000024d;      // { dg-error "universal character \\\\U00000024 is not 
valid in an identifier" "" { target c++26 } }
diff --git a/libcpp/charset.cc b/libcpp/charset.cc
index 54d7b9e09327..d58319a500a1 100644
--- a/libcpp/charset.cc
+++ b/libcpp/charset.cc
@@ -1808,7 +1808,12 @@ _cpp_valid_ucn (cpp_reader *pfile, const uchar **pstr,
       result = 1;
     }
   else if (identifier_pos && result == 0x24 
-          && CPP_OPTION (pfile, dollars_in_ident))
+          && CPP_OPTION (pfile, dollars_in_ident)
+          /* In C++26 when dollars are allowed in identifiers,
+             we should still reject \u0024 as $ is part of the basic
+             character set.  */
+          && !(CPP_OPTION (pfile, cplusplus)
+               && CPP_OPTION (pfile, lang) > CLK_CXX23))
     {
       if (CPP_OPTION (pfile, warn_dollars) && !pfile->state.skipping)
        {
diff --git a/libcpp/lex.cc b/libcpp/lex.cc
index de752bdc9c87..16f2c23af1e1 100644
--- a/libcpp/lex.cc
+++ b/libcpp/lex.cc
@@ -2718,7 +2718,10 @@ lex_raw_string (cpp_reader *pfile, cpp_token *token, 
const uchar *base)
                       || c == '*' || c == '+' || c == '-' || c == '/'
                       || c == '^' || c == '&' || c == '|' || c == '~'
                       || c == '!' || c == '=' || c == ','
-                      || c == '"' || c == '\''))
+                      || c == '"' || c == '\''
+                      || ((c == '$' || c == '@' || c == '`')
+                          && CPP_OPTION (pfile, cplusplus)
+                          && CPP_OPTION (pfile, lang) > CLK_CXX23)))
            prefix[prefix_len++] = c;
          else
            {

[gcc r15-2322] c++: Implement C++26 P2558R2 - Add @, $, and ` to the basic character set [PR110343]

Reply via email to