[Bug c/67224] UTF-8 support for identifier names in GCC

lhyatt at gmail dot com Mon, 22 Jul 2019 16:17:00 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224


Lewis Hyatt <lhyatt at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lhyatt at gmail dot com

--- Comment #26 from Lewis Hyatt <lhyatt at gmail dot com> ---
Created attachment 46618
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46618&action=edit
Patch with test cases that implements extended chars in identifiers

Hi All-

I am interested in helping out with this if there is still interest to support
this feature. (Full disclosure, I don't have any experience with the gcc
codebase, but I do have a lot of experience developing with gcc.)

I took a crack at implementing it based on Joseph's outline in Comment #21 and
the rest of the discussion in this thread. The patch is attached, including
test cases. (I more or less took all the existing ucnid* test cases and adapted
them for this, plus added a couple extra ones.) It seems to work fine, as far
as interpreting the identifiers and bootstrapping clean, and test cases also
pass except for one that I'll mention below, but I have many comments +
questions as well:

1. The number of changes to libcpp is actually pretty small. All the work to
recognize UTF-8 happens in forms_identifier_p(), so that the existing fast
paths for regular characters are not affected, and that extended chars end up
getting treated just like UCNs for the most part. forms_identifier_p() makes
use of a new utility _cpp_valid_utf8_in_identifier() in charset.c that is
similar to the existing _cpp_valid_ucn() and handles the UTF8 details.

2. otherwise _cpp_interpret_identifier() and lex_identifier() didn't need any
changes. The former could be optimized a bit, it always allocates a temporary
buffer, even though the buffer is only needed if UCNs appear. (This is the case
already in the case of dollar signs that end up in this code path too.) 
Probably it's not a big deal though.

3. Invalid UTF-8 is left alone and parsed as stray tokens, the same as now.

4. Regarding the case of codepoints not allowed to appear in an identifier. In
C, I did as Joseph suggested, or at least I tried to. The grammar specifies
that an identifier ends once an illegal character is encountered, so this is
how it works now, and then the disallowed UTF-8 forms a stray token next. It
was not clear to me though whether this stray token should consist of just the
next 1 byte of the input, or the entire disallowed UTF-8 character. Currently
it's just the next byte because that's how things worked out of the box.
Changing it wouldn't be too hard, just means the default case of
_cpp_lex_direct()'s main switch statement would need to try to read a UTF char
rather than a byte.

  In C++, I think UCNs or UTF-8 in identifiers should be treated identically in
all respects, unless I misunderstand things (because technically the UTF-8 was
supposed to be converted to UCNs in translation phase 1), so in that case a
disallowed codepoint does not end the token but rather triggers an invalid
character error.

5. There is a problem with diagnostics output when the source contains UTF-8
characters. The locator caret ends up in the wrong place, I assume just because
this code is not aware of the multibyte encoding. That much is not specific to
my patch, it exists already now e.g. with:

$ cat t.cpp
const char* x = "ππππππππππππππ"; int y = z;

$ g++ -c t.cpp
t.cpp:1:57: error: ‘z’ was not declared in this scope
 const char* x = "ππππππππππππππ"; int y = z;
                                                         ^

The bigger problem though is in layout::print_source_line() which colorizes the
source lines. It seems to end up attempting to colorize just the first byte,
even for UTF-8, which makes the output no longer valid. I tried to look into it
but I wasn't sure what are the implications, e.g. would it require some much
larger overhaul of diagnostics infrastructure anyway to get this right, and
would it perhaps be better just to disable colorization in the presence of
UTF-8 input or something like this, for the meantime.

As an example of what I mean, from preprocessing this (in c99 mode):

--------
int ٩;
--------

I get:
t3.c:1:5: error: universal character ٩ is not valid at the start of an
identifier
    1 | int ٩;

But if color is enabled, the output gets corrupted because ANSI escapes are
inserted between the two bytes of the multibyte character on the 2nd line of
diagnostics.

6. There is also a problem with formatting the output of some diagnostics, e.g.
when I compile this:

---------------
int π = 3;
int x = π2;
---------------

I get:
t2.cpp:2:9: error: ‘π2’ was not declared in this scope; did you mean
‘\xcf\x80’? This is also not specific to this patch and occurs the same if UCN
is used:

$ cat t.cpp
int \u03C0 = 3;
int x = \u03C02;

$ g++ -c t.cpp
t.cpp:2:9: error: ‘π2’ was not declared in this scope
 int x = \u03C02;
         ^~~~~~~
t.cpp:2:9: note: suggested alternative: ‘\xcf\x80’
 int x = \u03C02;
         ^~~~~~~

7. What is the expected output from gcc -E of this code?

-------
int π;
--------

Currently it outputs:
int \U000003c0;

So curiously, it's as if C++ required translation of extended chars to UCNs is
happening, so I think this output is actually potentially correct in C++ mode?
But it is also this way in C mode which I think is probably not expected. It
seems to come from cpp_output_token() which does not make use of the "original
spelling" data structures. I am not sure about this one but probably the right
solution is not much work, if someone knows what that might be?

This is also the reason that one of the new testcases
(gcc/testsuite/gcc.dg/cpp/ucnid-13-utf8.c) fails, this:

#define Á 1

also preprocesses (in -E -dD) to include UCNs. I am not sure what is expected
here.

8. There are tests (e.g. gcc/testsuite/gcc.dg/ucnid-10.c) which verify that
when the locale is not utf8, diagnostics use UCNs instead of raw UTF8. I am not
sure if this still makes sense when the files themselves contain UTF8, but that
was the behavior that came out so I maintained these tests as well.

Thanks for taking a look at this, and for all your work on gcc. I am happy to
work on this if someone has any feedback on what I tried so far. 

-lewis

[Bug c/67224] UTF-8 support for identifier names in GCC

Reply via email to