------- Additional Comments From neil at daikokuya dot co dot uk 2005-02-21
23:00 -------
Subject: Re: UCNs not recognized in identifiers (c++/c99)
jsm28 at gcc dot gnu dot org wrote:-
> * The greedy algorithm applies for lexing UCNs: for example,
> a\U0000000z is three preprocessing tokens {a}{\}{U0000000z} (and
> shouldn't get a diagnostic on lexing, presuming macros are defined
> such that the eventual token sequence is valid).
I'm not sure I agree with this: it would seem to be unnecessary
extra work; further I suspect the user would benefit from it being
pointed out he entered an ill-formed UCN rather than something random
from the front end complaining about an unexpected backslash.
The only case where you wouldn't get a syntax error from the
front end, or an invalid escape in a literal, is with -E. I'm
not sure lexing to the letter of the standard is worthwhile in
this case, as the standard doesn't discuss -E.
If you have an example where a compiled program is acceptable
with multiple lexing tokens then I would agree with you.
> * The spelling of UCNs is preserved for the # and ## operators.
This is very hard with CPP's current implementation - it assumes
it can deduce the spelling of an identifier from its hash table
entry. IMO the proper way to fix this to use a different approach
entirely, rather than kludge it in the existing implementation
(which would bloat some common datastructures) but that's some work.
> * I think the only reasonable interpretation of the lexing rules in
> the context of forbidden characters is that first identifiers are
> lexed (allowing any UCNs) then bad characters yield an error (rather
> than stopping the identifier before the bad character and treating it
> as not a UCN).
Agreed - as I say above I don't see why this shouldn't apply for
partial UCNs too, even with -E.
The rest seems reasonable.
Neil.
--
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=9449