https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110343
Jakub Jelinek <jakub at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
CC| |redi at gcc dot gnu.org
--- Comment #9 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
I've tried to understand the preprocessor issue mentioned in the paper, but am
confused on what is the right behavior and why.
Consider
#define STR(x) #x
const char *a = "\u00b7";
const char *b = STR(\u00b7);
const char *c = "\u0041";
const char *d = STR(\u0041);
const char *e = STR(a\u00b7);
const char *f = STR(a\u0041);
const char *g = STR(a \u00b7);
const char *h = STR(a \u0041);
const char *i = "\u066d";
const char *j = STR(\u066d);
const char *k = "\u0040";
const char *l = STR(\u0040);
const char *m = STR(a\u066d);
const char *n = STR(a\u0040);
const char *o = STR(a \u066d);
const char *p = STR(a \u0040);
Neither clang nor gcc emit any diagnostics on the a, c, i and k initializers,
those are certainly valid.
g++ emits with -pedantic-errors errors on all the others, while clang++ on the
ones with STR involving \u0041, \u0040 and a\u0066d.
The chosen values are \u0040 '@' as something being changed by this paper,
\u0041 'A',
\u00b7 as an example of character which is pedantically valid in identifiers if
not at the start and \u066d s something pedantically not valid in identifiers.
Now, https://eel.is/c++draft/lex.charset#6 says that UCN used outside of a
string/character literal which corresponds to basic character set character (or
control character) is ill-formed, that would make d, f, h cases invalid for C++
and l, n, p cases invalid for C++26.
https://eel.is/c++draft/lex.name states which characters can appear at the
start of the identifier and which can appear after the start.
And https://eel.is/c++draft/lex.pptoken states that preprocessing-token is
either identifier, or tons of other things, or
"each non-whitespace character that cannot be one of the above"
Then https://eel.is/c++draft/lex.pptoken#1 says that this last category is
invalid if the preprocessing token is being converted into token.
And https://eel.is/c++draft/lex.pptoken#2 includes
"If any character not in the basic character set matches the last category, the
program is ill-formed."
Now, e.g. for the C++23 STR(\u0040) case, \u0040 is there not in the basic
character set, so valid outside of the literals (not the case anymore in
C++26), but it isn't nondigit and doesn't have XID_Start property, so it isn't
IMHO an identifier and so must be the "each non-whitespace character that
cannot be one of the above" case.
Why doesn't the above mentioned https://eel.is/c++draft/lex.pptoken#2 sentence
make that invalid? Ignoring that, I'd say it would be then stringized and that
feels like it is what clang++ is doing.
Now, e.g. for the STR(a\u066d) case, I wonder why that isn't lexed as a
identifier
followed by \u066d "each non-whitespace character that cannot be one of the
above"
token and stringified similarly, clang++ rejects that.
What GCC libcpp seems to be doing is that if that forms_identifier_p calls
_cpp_valid_utf8 or _cpp_valid_ucn with an argument which tells it is first or
second+ in identifier, and e.g. _cpp_valid_ucn then for UCNs valid in string
literals calls
else if (identifier_pos)
{
int validity = ucn_valid_in_identifier (pfile, result, nst);
if (validity == 0)
cpp_error (pfile, CPP_DL_ERROR,
"universal character %.*s is not valid in an identifier",
(int) (str - base), base);
else if (validity == 2 && identifier_pos == 1)
cpp_error (pfile, CPP_DL_ERROR,
"universal character %.*s is not valid at the start of an identifier",
(int) (str - base), base);
}
so basically all those invalid in identifiers cases emit an error and pretend
to be valid in identifiers, rather than what e.g. _cpp_valid_utf8 does for C
but not for C++ and only for the chars completely invalid in identifiers rather
than just valid in identifiers but not at the start:
/* In C++, this is an error for invalid character in an identifier
because logically, the UTF-8 was converted to a UCN during
translation phase 1 (even though we don't physically do it that
way). In C, this byte rather becomes grammatically a separate
token. */
if (CPP_OPTION (pfile, cplusplus))
cpp_error (pfile, CPP_DL_ERROR,
"extended character %.*s is not valid in an identifier",
(int) (*pstr - base), base);
else
{
*pstr = base;
return false;
}
The comment doesn't really match what is done in recent C++ versions because
there UCNs are translated to characters and not the other way around.