On Wed, Feb 15, 2023 at 1:39 PM Jason Merrill <ja...@redhat.com> wrote: > > On 9/26/22 15:27, Lewis Hyatt wrote: > > On Wed, Jun 15, 2022 at 03:06:16PM -0400, Lewis Hyatt wrote: > >> On Tue, Jun 14, 2022 at 05:26:49PM -0400, Lewis Hyatt wrote: > >>> Hello- > >>> > >>> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103902 > >>> > >>> The attached patch resolves PR preprocessor/103902 as described in the > >>> patch > >>> message inline below. bootstrap + regtest all languages was successful on > >>> x86-64 Linux, with no new failures: > >>> > >>> FAIL 103 103 > >>> PASS 542338 542371 > >>> UNSUPPORTED 15247 15250 > >>> UNTESTED 136 136 > >>> XFAIL 4166 4166 > >>> XPASS 17 17 > >>> > >>> Please let me know if it looks OK? > >>> > >>> A few questions I have: > >>> > >>> - A difference introduced with this patch is that after lexing something > >>> like `operator ""_abc', then `_abc' is added to the identifier hash map, > >>> whereas previously it was not. I feel like this must be OK because with > >>> the > >>> optional space as in `operator "" _abc', it would be added with or > >>> without the > >>> patch. > >>> > >>> - The behavior of `#pragma GCC poison' is not consistent (including prior > >>> to > >>> my patch). I tried to make it more so but there is still one thing I > >>> want to > >>> ask about. Leaving aside extended characters for now, the > >>> inconsistency is > >>> that currently the poison is only checked, when the suffix appears as a > >>> standalone token. > >>> > >>> #pragma GCC poison _X > >>> bool operator ""_X (unsigned long long); //accepted before the patch, > >>> //rejected after it > >>> bool operator "" _X (unsigned long long); //rejected either before or > >>> after > >>> const char * operator ""_X (const char *, unsigned long); //accepted > >>> before, > >>> //rejected > >>> after > >>> const char * operator "" _X (const char *, unsigned long); //rejected > >>> either > >>> > >>> const char * s = ""_X; //accepted before the patch, rejected after it > >>> const bool b = 1_X; //accepted before or after **** > >>> > >>> I feel like after the patch, the behavior is the expected behavior for all > >>> cases but the last one. Here, we allow the poisoned identifier because > >>> it's > >>> not lexed as an identifier, it's lexed as part of a pp-number. Does it > >>> seem OK > >>> like this or does it need to be addressed? > >> > >> Sorry, that version actually did not handle the case of -Wc++11-compat in > >> c++98 mode correctly. This updated version fixes that and adds the missing > >> test coverage for that, if you could please review this one instead? > >> > >> By the way, the pipermail archive seems to permanently mangle UTF-8 in > >> inline > >> attachments. I attached the patch also gzipped to address that for the > >> archive, since the new testcases do use non-ASCII characters. > >> > >> Thanks for taking a look! > > > > Hello- > > > > May I please ping this patch again? Joseph suggested that it would be best > > if > > a C++ maintainer has a look at it. This is one of just a few places left > > where > > we don't handle UTF-8 properly in libcpp, it would be really nice to get > > them > > fixed up if there is time to review this patch. Thanks! > > > > https://gcc.gnu.org/pipermail/gcc-patches/2022-June/596704.html > > > > I re-attached it here as it required some trivial rebasing on top of > > recently > > pushed changes. As before, I also attached the gzipped version so that the > > UTF-8 testcases show up OK in the online archive, in case that's still an > > issue. Thanks for taking a look! > > Thank you for the patch, sorry it slipped off my radar. >
Thanks for taking a look at it. It's certainly an edge case that is not bothering anyone too much, so no rush with it. > > This patch fixes it by adding a new function scan_cur_identifier() that can > > be > > used to lex an identifier while in the middle of lexing another token. It is > > somewhat duplicative of the code in lex_identifier(), which handles the > > normal > > case, but I think there's no good way to avoid that without pessimizing the > > usual case, since lex_identifier() takes advantage of the fact that the > > first > > character of the identifier has already been analyzed. > > So could you analyze the first character and then call lex_identifier? > Yes, it can be done this way. lex_identifier may need some adaptations though, since it does some other work like tracking the original spelling of the identifier. Plus per your comments below, it would need to avoid the poison and other checks too. I think it's pretty straightforward to refactor a bit so that it works out. I kinda thought it may not be desirable to touch lex_identifier, which is called on everything, just to handle this rare case, however I am happy to do it this way after confirming it won't hurt performance. > > With scan_cur_identifier(), we do also correctly warn about bidi and > > normalization issues in the extended identifiers comprising the suffix, and > > we > > check for poisoned identifiers there as well. > > Hmm, I don't think we want the check for poisoned identifiers; a suffix > is not a name. That goes for the other diagnostics in > identifier_diagnostics_on_lex, as well. At the meeting last week the > committee decided to deprecate the declaration with a space to clarify > this distinction. > OK thanks, interesting. > > + if (!accum.accum) > > + create_literal2 (pfile, token, base, > > + suffix_begin - base, > > + NODE_NAME (sr.node), > > + NODE_LEN (sr.node), > > + type); > > + else > > + { > > + accum.create_literal2 (pfile, token, base, > > + suffix_begin - base, > > + NODE_NAME (sr.node), > > + NODE_LEN (sr.node), > > + type); > > + _cpp_release_buff (pfile, accum.first); > > + } > > How about always calling accum.create_literal2? Yes it should be equivalent and better that way. Sounds like I should try a version that attempts to reuse the logic in lex_identifier and cleans up your other points, so I'll look into that next. Thanks! -Lewis