[PATCH] D138861: [Clang] Implement CWG2640 Allow more characters in an n-char sequence

Corentin Jabot via Phabricator via cfe-commits Fri, 09 Dec 2022 15:31:31 -0800

cor3ntin updated this revision to Diff 481766.
cor3ntin added a comment.

- Rebase
- Add/Improve comments
- Add more trigrahs tests
- Fix crash when a trigraph appears at the end of a named escape sequence and 
trigraphs are disabled
- Fix when getAndAdvanceChar is called - alas there is no way to write tests 
for that but I did check in a debugger.
- Rename s/Res/Match


There are still some inefficiencies with getAndAdvanceChar, but it calls for a
much bigger refactor than what should be in scope for this patch.


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138861/new/

https://reviews.llvm.org/D138861

Files:
  clang/docs/ReleaseNotes.rst
  clang/lib/Lex/Lexer.cpp
  clang/lib/Lex/LiteralSupport.cpp
  clang/test/CXX/drs/dr26xx.cpp
  clang/test/Lexer/char-escapes-delimited.c
  clang/test/Lexer/unicode.c
  clang/test/Preprocessor/ucn-pp-identifier.c
  clang/www/cxx_dr_status.html

Index: clang/www/cxx_dr_status.html
===================================================================
--- clang/www/cxx_dr_status.html
+++ clang/www/cxx_dr_status.html
@@ -15647,7 +15647,7 @@
     <td><a href="https://wg21.link/cwg2640";>2640</a></td>
     <td>accepted</td>
     <td>Allow more characters in an n-char sequence</td>
-    <td class="none" align="center">Unknown</td>
+    <td class="unreleased" align="center">Clang 16</td>
   </tr>
   <tr id="2641">
     <td><a href="https://wg21.link/cwg2641";>2641</a></td>
Index: clang/test/Preprocessor/ucn-pp-identifier.c
===================================================================
--- clang/test/Preprocessor/ucn-pp-identifier.c
+++ clang/test/Preprocessor/ucn-pp-identifier.c
@@ -1,6 +1,6 @@
-// RUN: %clang_cc1 %s -fsyntax-only -std=c99 -pedantic -verify=expected,ext -Wundef
-// RUN: %clang_cc1 %s -fsyntax-only -x c++ -pedantic -verify=expected,ext -Wundef
-// RUN: %clang_cc1 %s -fsyntax-only -x c++ -std=c++2b -pedantic -ftrigraphs -verify=expected,cxx2b -Wundef -Wpre-c++2b-compat
+// RUN: %clang_cc1 %s -fsyntax-only -std=c99 -pedantic -verify=expected,ext -Wundef -DTRIGRAPHS=1
+// RUN: %clang_cc1 %s -fsyntax-only -x c++ -pedantic -verify=expected,ext -Wundef -fno-trigraphs
+// RUN: %clang_cc1 %s -fsyntax-only -x c++ -std=c++2b -pedantic -ftrigraphs -DTRIGRAPHS=1 -verify=expected,cxx2b -Wundef -Wpre-c++2b-compat
 // RUN: %clang_cc1 %s -fsyntax-only -x c++ -pedantic -verify=expected,ext -Wundef -ftrigraphs -DTRIGRAPHS=1
 // RUN: not %clang_cc1 %s -fsyntax-only -std=c99 -pedantic -Wundef 2>&1 | FileCheck -strict-whitespace %s
 
@@ -40,7 +40,6 @@
                    // ext-warning {{extension}} cxx2b-warning {{before C++2b}}
 #define \N{WASTEBASKET} // expected-error {{macro name must be an identifier}} \
                         // ext-warning {{extension}} cxx2b-warning {{before C++2b}}
-
 #define a\u0024
 
 #if \u0110 // expected-warning {{is not defined, evaluates to 0}}
@@ -121,20 +120,39 @@
 #define \u{123456789}  // expected-error {{hex escape sequence out of range}} expected-error {{macro name must be an identifier}}
 #define \u{            // expected-warning {{incomplete delimited universal character name; treating as '\' 'u' '{' identifier}} expected-error {{macro name must be an identifier}}
 #define \u{fgh}        // expected-warning {{incomplete delimited universal character name; treating as '\' 'u' '{' identifier}} expected-error {{macro name must be an identifier}}
-#define \N{            // expected-warning {{incomplete delimited universal character name; treating as '\' 'N' '{' identifier}} expected-error {{macro name must be an identifier}}
+#define \N{
+// expected-warning@-1 {{incomplete delimited universal character name; treating as '\' 'N' '{' identifier}}
+// expected-error@-2 {{macro name must be an identifier}}
 #define \N{}           // expected-warning {{empty delimited universal character name; treating as '\' 'N' '{' '}'}} expected-error {{macro name must be an identifier}}
 #define \N{NOTATHING}  // expected-error {{'NOTATHING' is not a valid Unicode character name}} \
                        // expected-error {{macro name must be an identifier}}
 #define \NN            // expected-warning {{incomplete universal character name; treating as '\' followed by identifier}} expected-error {{macro name must be an identifier}}
 #define \N{GREEK_SMALL-LETTERALPHA}  // expected-error {{'GREEK_SMALL-LETTERALPHA' is not a valid Unicode character name}} \
                                      // expected-note {{characters names in Unicode escape sequences are sensitive to case and whitespaces}}
+#define \N{ð¤¡}  // expected-error {{'ð¤¡' is not a valid Unicode character name}} \
+                // expected-error {{macro name must be an identifier}}
 
 #define CONCAT(A, B) A##B
-int CONCAT(\N{GREEK, CAPITALLETTERALPHA}); // expected-error{{expected}} \
-                                           // expected-warning {{incomplete delimited universal character name}}
+int CONCAT(\N{GREEK
+, CAPITALLETTERALPHA});
+// expected-error@-2 {{expected}} \
+// expected-warning@-2 {{incomplete delimited universal character name}}
+
+int \N{\
+LATIN CAPITAL LETTER A WITH GRAVE};
+//ext-warning@-2 {{extension}} cxx2b-warning@-2 {{before C++2b}}
 
 #ifdef TRIGRAPHS
-int \N??<GREEK CAPITAL LETTER ALPHA??> = 0; // expected-warning{{extension}} cxx2b-warning {{before C++2b}} \
+int \N??<GREEK CAPITAL LETTER ALPHA??> = 0; // cxx2b-warning {{before C++2b}} \
+                                            //ext-warning {{extension}}\
                                             // expected-warning 2{{trigraph converted}}
 
+int a\N{LATIN CAPITAL LETTER A WITH GRAVE??>; // expected-warning {{trigraph converted}}
+#endif
+
+#ifndef TRIGRAPHS
+int a\N{LATIN CAPITAL LETTER A WITH GRAVE??>;
+// expected-warning@-1 {{trigraph ignored}}\
+// expected-warning@-1 {{incomplete}}\
+// expected-error@-1 {{expected ';' after top level declarator}}
 #endif
Index: clang/test/Lexer/unicode.c
===================================================================
--- clang/test/Lexer/unicode.c
+++ clang/test/Lexer/unicode.c
@@ -43,6 +43,7 @@
 extern int  \U0001E4D0; // ð NAG MUNDARI LETTER O - Added in Unicode 15
 extern int _\N{TANGSA LETTER GA};
 extern int _\N{TANGSALETTERGA}; // expected-error {{'TANGSALETTERGA' is not a valid Unicode character name}} \
+                                // expected-error {{expected ';' after top level declarator}} \
                                 // expected-note {{characters names in Unicode escape sequences are sensitive to case and whitespace}}
 
 
Index: clang/test/Lexer/char-escapes-delimited.c
===================================================================
--- clang/test/Lexer/char-escapes-delimited.c
+++ clang/test/Lexer/char-escapes-delimited.c
@@ -96,6 +96,11 @@
   unsigned i = u'\N{GREEK CAPITAL LETTER DELTA}'; // ext-warning {{extension}} cxx2b-warning {{C++2b}}
   char j = '\NN';                                 // expected-error {{expected '{' after '\N' escape sequence}} expected-warning {{multi-character character constant}}
   unsigned k = u'\N{LOTUS';                       // expected-error {{incomplete universal character name}}
+
+  const char* emoji = "\N{ð¤¡}"; // expected-error {{'ð¤¡' is not a valid Unicode character name}} \
+                                // expected-note 5{{did you mean}}
+  const char* nested = "\N{\N{SPARKLE}}"; // expected-error {{'\N{SPARKLE' is not a valid Unicode character name}} \
+                                          // expected-note 5{{did you mean}}
 }
 
 void separators(void) {
Index: clang/test/CXX/drs/dr26xx.cpp
===================================================================
--- clang/test/CXX/drs/dr26xx.cpp
+++ clang/test/CXX/drs/dr26xx.cpp
@@ -104,3 +104,18 @@
     return k();
   }
 }
+
+namespace dr2640 { // dr2640: 16
+
+int \N{Î} = 0; //expected-error {{'Î' is not a valid Unicode character name}} \
+               //expected-error {{expected unqualified-id}}
+const char* emoji = "\N{ð¤¡}"; // expected-error {{'ð¤¡' is not a valid Unicode character name}} \
+                              // expected-note 5{{did you mean}}
+
+#define z(x) 0
+#define a z(
+int x = a\N{abc}); // expected-error {{'abc' is not a valid Unicode character name}}
+int y = a\N{LOTUS}); // expected-error {{character <U+1FAB7> not allowed in an identifier}} \
+                     // expected-error {{use of undeclared identifier 'aðª·'}} \
+                     // expected-error {{extraneous ')' before ';'}}
+}
Index: clang/lib/Lex/LiteralSupport.cpp
===================================================================
--- clang/lib/Lex/LiteralSupport.cpp
+++ clang/lib/Lex/LiteralSupport.cpp
@@ -548,11 +548,10 @@
     return false;
   }
   ThisTokBuf++;
-  const char *ClosingBrace =
-      std::find_if_not(ThisTokBuf, ThisTokEnd, [](char C) {
-        return llvm::isAlnum(C) || llvm::isSpace(C) || C == '_' || C == '-';
-      });
-  bool Incomplete = ClosingBrace == ThisTokEnd || *ClosingBrace != '}';
+  const char *ClosingBrace = std::find_if(ThisTokBuf, ThisTokEnd, [](char C) {
+    return C == '}' || isVerticalWhitespace(C);
+  });
+  bool Incomplete = ClosingBrace == ThisTokEnd;
   bool Empty = ClosingBrace == ThisTokBuf;
   if (Incomplete || Empty) {
     if (Diags) {
Index: clang/lib/Lex/Lexer.cpp
===================================================================
--- clang/lib/Lex/Lexer.cpp
+++ clang/lib/Lex/Lexer.cpp
@@ -1195,15 +1195,15 @@
 /// whether trigraphs are enabled or not.
 static char DecodeTrigraphChar(const char *CP, Lexer *L, bool Trigraphs) {
   char Res = GetTrigraphCharForLetter(*CP);
-  if (!Res || !L) return Res;
+  if (!Res) return Res;
 
   if (!Trigraphs) {
-    if (!L->isLexingRawMode())
+    if (L && !L->isLexingRawMode())
       L->Diag(CP-2, diag::trigraph_ignored);
     return 0;
   }
 
-  if (!L->isLexingRawMode())
+  if (L && !L->isLexingRawMode())
     L->Diag(CP-2, diag::trigraph_converted) << StringRef(&Res, 1);
   return Res;
 }
@@ -3295,7 +3295,10 @@
 
   if (Result) {
     Result->setFlag(Token::HasUCN);
-    if (CurPtr - StartPtr == (ptrdiff_t)(Count + 2 + (Delimited ? 2 : 0)))
+    // If the UCN contains either a trigraph or a line splicing,
+    // we need to call getAndAdvanceChar again to set the appropriate flags
+    // on Result.
+    if (CurPtr - StartPtr == (ptrdiff_t)(Count + 1 + (Delimited ? 2 : 0)))
       StartPtr = CurPtr;
     else
       while (StartPtr != CurPtr)
@@ -3335,7 +3338,7 @@
       break;
     }
 
-    if (!isAlphanumeric(C) && C != '_' && C != '-' && C != ' ')
+    if (isVerticalWhitespace(C))
       break;
     Buffer.push_back(C);
   }
@@ -3349,14 +3352,14 @@
   }
 
   StringRef Name(Buffer.data(), Buffer.size());
-  llvm::Optional<char32_t> Res =
+  llvm::Optional<char32_t> Match =
       llvm::sys::unicode::nameToCodepointStrict(Name);
   llvm::Optional<llvm::sys::unicode::LooseMatchingResult> LooseMatch;
-  if (!Res) {
-    if (!isLexingRawMode()) {
+  if (!Match) {
+    LooseMatch = llvm::sys::unicode::nameToCodepointLooseMatching(Name);
+    if (Diagnose) {
       Diag(StartPtr, diag::err_invalid_ucn_name)
           << StringRef(Buffer.data(), Buffer.size());
-      LooseMatch = llvm::sys::unicode::nameToCodepointLooseMatching(Name);
       if (LooseMatch) {
         Diag(StartName, diag::note_invalid_ucn_name_loose_matching)
             << FixItHint::CreateReplacement(
@@ -3364,35 +3367,37 @@
                    LooseMatch->Name);
       }
     }
-    // When finding a match using Unicode loose matching rules
-    // recover after having emitted a diagnostic.
-    if (!LooseMatch)
-      return std::nullopt;
     // We do not offer misspelled character names suggestions here
     // as the set of what would be a valid suggestion depends on context,
     // and we should not make invalid suggestions.
   }
 
-  if (Diagnose && PP && !LooseMatch)
+  if (Diagnose && Match)
     Diag(BufferPtr, PP->getLangOpts().CPlusPlus2b
                         ? diag::warn_cxx2b_delimited_escape_sequence
                         : diag::ext_delimited_escape_sequence)
         << /*named*/ 1 << (PP->getLangOpts().CPlusPlus ? 1 : 0);
 
-  if (LooseMatch)
-    Res = LooseMatch->CodePoint;
+  // If no diagnostic has been emitted yet, likely because we are doing a tentative lexing,
+  // we do not want to recover here to make sure the token will not be incorrectly considered valid.
+  // This function will be called again and a diagnostic emitted then.
+  if (LooseMatch && Diagnose)
+    Match = LooseMatch->CodePoint;
 
   if (Result) {
     Result->setFlag(Token::HasUCN);
-    if (CurPtr - StartPtr == (ptrdiff_t)(Buffer.size() + 4))
+    // If the UCN contains either a trigraph or a line splicing,
+    // we need to call getAndAdvanceChar again to set the appropriate flags
+    // on Result.
+    if (CurPtr - StartPtr == (ptrdiff_t)(Buffer.size() + 3))
       StartPtr = CurPtr;
     else
       while (StartPtr != CurPtr)
         (void)getAndAdvanceChar(StartPtr, *Result);
   } else {
-    StartPtr = CurPtr;
+      StartPtr = CurPtr;
   }
-  return *Res;
+  return Match ? llvm::Optional<uint32_t>(*Match) : llvm::None;
 }
 
 uint32_t Lexer::tryReadUCN(const char *&StartPtr, const char *SlashLoc,
Index: clang/docs/ReleaseNotes.rst
===================================================================
--- clang/docs/ReleaseNotes.rst
+++ clang/docs/ReleaseNotes.rst
@@ -707,6 +707,7 @@
 - Implemented "char8_t Compatibility and Portability Fix" (`P2513R3 <https://wg21.link/P2513R3>`_).
   This change was applied to C++20 as a Defect Report.
 - Implemented "Permitting static constexpr variables in constexpr functions" (`P2647R1 <https://wg21.link/P2647R1>_`).
+- Implemented `CWG2640 Allow more characters in an n-char sequence <https://wg21.link/CWG2640>_`.
 
 CUDA/HIP Language Changes in Clang
 ----------------------------------

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D138861: [Clang] Implement CWG2640 Allow more characters in an n-char sequence

Reply via email to