[PATCH] D138861: [Clang] Implement CWG2640 Allow more characters in an n-char sequence

Corentin Jabot via Phabricator via cfe-commits Fri, 09 Dec 2022 11:57:07 -0800

cor3ntin added inline comments.


================
Comment at: clang/lib/Lex/Lexer.cpp:3378-3379
 
-  if (LooseMatch)
+  // If no diagnostic has been emitted yet we do not want to recover here
+  // to make sure this function will be called again and a diagnostic emitted 
then.
+  if (LooseMatch && Diagnose)
----------------
tahonermann wrote:
> cor3ntin wrote:
> > tahonermann wrote:
> > > I'm having a hard time understanding what this comment is trying to 
> > > convey. The comment response to Aaron was much more helpful to me. How 
> > > about this suggested edit?
> > It's not that a diagnostic would be emitted twice, is that it would not be 
> > emitted at all.
> > By recovering - ie returning a loose match - we prevent the function to be 
> > called again on the same buffer.
> > So if during the first call Diagnose is false (because we were called from 
> > a tryXXXX function), then the compilation will continue assuming we were in 
> > fact able to parse an identifier and never informs us of an invalid name.
> More evidence of the difficulty I'm having appreciating this comment :)
> 
> Does this sound right?
>   // Only map a loose match to a code point when in a diagnosing state.
>   // If a mapped code point were to be returned in a non-diagnosing state,
>   // token caching would prevent the opportunity to issue a diagnostic when
>   // the token is later used.
> 
> Tangent: It might be worth renaming one of `Res` and `Result`. I keep getting 
> confused due to the similar names. Perhaps rename `Result` to `Token` or 
> `ResultToken`?
I'm not sure caching is exactly right.
There are tentative parses, that should not give false positive answers. I'll 
try to message a better comment


================
Comment at: clang/lib/Lex/Lexer.cpp:3386
+
+  if (!Result || CurPtr - StartPtr == (ptrdiff_t)(Buffer.size() + 4))
     StartPtr = CurPtr;
----------------
tahonermann wrote:
> cor3ntin wrote:
> > cor3ntin wrote:
> > > tahonermann wrote:
> > > > This is a bit of a tangent, but under what circumstances would `CurPtr 
> > > > - StartPtr` not be equal to `Buffer.size() + 4`? Actually, I'm not sure 
> > > > that +4 is correct. It looks like `StartPtr` is expected to point to 
> > > > `N` at the beginning of the function, so the initial `\` is not 
> > > > included in the range. If `N` isn't seen, the function returns early. 
> > > > Likewise, if either of the `{` or `}` delimiters is not found, the 
> > > > function returns early.
> > > > 
> > > > I think the goal of this code is to skip the `getAndAdvanceChar()` 
> > > > machinery when the buffer is being claimed (no need to parse UCNs or 
> > > > trigraphs in the character name), but it looks like, particularly with 
> > > > this DR, that we might always be able to do that.
> > > afaict, 4 is correct here because we are one past-the-end.
> > > I do however agree that this whole pattern which is used a few times is 
> > > probably unnecessary, i do think it would be good to investigate. Not in 
> > > this patch though, imo
> > I looked into it, I'm sure we could improve but not easily, 
> > `getAndAdvanceChar` does set some flags on the token in the presence of 
> > trigraphs/line splicing and we need those flags to be set, this is the 
> > reason we do need to call that method.
> > It's not super efficient but it's such an edge case... I'd rather not touch 
> > that now
> My concern is that, as is, the code always takes the `else` branch (except 
> when `Result` is non-null). Assuming a buffer containing "X", the pointers 
> are arranged as follows (where `$` is one past the end).
>   \ N { X } $
>     |   |   `- CurPtr
>     |   `- Buffer
>     `- StartPtr
> `CurPtr - StartPtr` is 4, but `Buffer.size() + 4` is 5 (`Buffer.size()` is 1 
> in this case).
> 
> I think there might be an easy test to see if this is working as intended. If 
> it isn't, I would expect a diagnostic to be issued if trigraphs are enabled 
> and the buffer contains a trigraph sequence. Something like:
>   \N{LOTUS??>}
I can try to add tests

> My concern is that, as is, the code always takes the else branch (except when 
> Result is non-null). 
Yes, the if branch sole purpose is to set some state in Result.

At the start of the function, StartPtr points to `\`
And I'll leave a comment, maybe that will clear up future confusions

There may be a potential refactor here, which is to have `getAndAdvanceChar` 
take a `bool & ShouldCleanupToken` parameter instead of a token so that we 
don't have to do that dance, but it's a bit outside of the scope of this 
patch...


================
Comment at: clang/test/Preprocessor/ucn-pp-identifier.c:141
+// expected-warning@-2 {{incomplete delimited universal character name}}
 
 #ifdef TRIGRAPHS
----------------
tahonermann wrote:
> How about adding a test for an escaped new line? I think this is currently UB 
> according to the standard ([[ http://eel.is/c++draft/lex.phases#1.2 | 
> (lex.phases)p2 ]]), but I believe you are trying to change that with [[ 
> https://wg21.link/p2621 | P2621 ]].
>   int \N{\
>   LATIN CAPITAL LETTER A WITH GRAVE};
I can add a test, sure
P2621 is not super relevant (or rather, it standardize behaviors that clang 
already has)


Repository:
  rG LLVM Github Monorepo

CHANGES SINCE LAST ACTION
  https://reviews.llvm.org/D138861/new/

https://reviews.llvm.org/D138861

_______________________________________________
cfe-commits mailing list
cfe-commits@lists.llvm.org
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits

[PATCH] D138861: [Clang] Implement CWG2640 Allow more characters in an n-char sequence

Reply via email to