https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224
Lewis Hyatt <lhyatt at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |lhyatt at gmail dot com --- Comment #26 from Lewis Hyatt <lhyatt at gmail dot com> --- Created attachment 46618 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46618&action=edit Patch with test cases that implements extended chars in identifiers Hi All- I am interested in helping out with this if there is still interest to support this feature. (Full disclosure, I don't have any experience with the gcc codebase, but I do have a lot of experience developing with gcc.) I took a crack at implementing it based on Joseph's outline in Comment #21 and the rest of the discussion in this thread. The patch is attached, including test cases. (I more or less took all the existing ucnid* test cases and adapted them for this, plus added a couple extra ones.) It seems to work fine, as far as interpreting the identifiers and bootstrapping clean, and test cases also pass except for one that I'll mention below, but I have many comments + questions as well: 1. The number of changes to libcpp is actually pretty small. All the work to recognize UTF-8 happens in forms_identifier_p(), so that the existing fast paths for regular characters are not affected, and that extended chars end up getting treated just like UCNs for the most part. forms_identifier_p() makes use of a new utility _cpp_valid_utf8_in_identifier() in charset.c that is similar to the existing _cpp_valid_ucn() and handles the UTF8 details. 2. otherwise _cpp_interpret_identifier() and lex_identifier() didn't need any changes. The former could be optimized a bit, it always allocates a temporary buffer, even though the buffer is only needed if UCNs appear. (This is the case already in the case of dollar signs that end up in this code path too.) Probably it's not a big deal though. 3. Invalid UTF-8 is left alone and parsed as stray tokens, the same as now. 4. Regarding the case of codepoints not allowed to appear in an identifier. In C, I did as Joseph suggested, or at least I tried to. The grammar specifies that an identifier ends once an illegal character is encountered, so this is how it works now, and then the disallowed UTF-8 forms a stray token next. It was not clear to me though whether this stray token should consist of just the next 1 byte of the input, or the entire disallowed UTF-8 character. Currently it's just the next byte because that's how things worked out of the box. Changing it wouldn't be too hard, just means the default case of _cpp_lex_direct()'s main switch statement would need to try to read a UTF char rather than a byte. In C++, I think UCNs or UTF-8 in identifiers should be treated identically in all respects, unless I misunderstand things (because technically the UTF-8 was supposed to be converted to UCNs in translation phase 1), so in that case a disallowed codepoint does not end the token but rather triggers an invalid character error. 5. There is a problem with diagnostics output when the source contains UTF-8 characters. The locator caret ends up in the wrong place, I assume just because this code is not aware of the multibyte encoding. That much is not specific to my patch, it exists already now e.g. with: $ cat t.cpp const char* x = "ππππππππππππππ"; int y = z; $ g++ -c t.cpp t.cpp:1:57: error: ‘z’ was not declared in this scope const char* x = "ππππππππππππππ"; int y = z; ^ The bigger problem though is in layout::print_source_line() which colorizes the source lines. It seems to end up attempting to colorize just the first byte, even for UTF-8, which makes the output no longer valid. I tried to look into it but I wasn't sure what are the implications, e.g. would it require some much larger overhaul of diagnostics infrastructure anyway to get this right, and would it perhaps be better just to disable colorization in the presence of UTF-8 input or something like this, for the meantime. As an example of what I mean, from preprocessing this (in c99 mode): -------- int ٩; -------- I get: t3.c:1:5: error: universal character ٩ is not valid at the start of an identifier 1 | int ٩; But if color is enabled, the output gets corrupted because ANSI escapes are inserted between the two bytes of the multibyte character on the 2nd line of diagnostics. 6. There is also a problem with formatting the output of some diagnostics, e.g. when I compile this: --------------- int π = 3; int x = π2; --------------- I get: t2.cpp:2:9: error: ‘π2’ was not declared in this scope; did you mean ‘\xcf\x80’? This is also not specific to this patch and occurs the same if UCN is used: $ cat t.cpp int \u03C0 = 3; int x = \u03C02; $ g++ -c t.cpp t.cpp:2:9: error: ‘π2’ was not declared in this scope int x = \u03C02; ^~~~~~~ t.cpp:2:9: note: suggested alternative: ‘\xcf\x80’ int x = \u03C02; ^~~~~~~ 7. What is the expected output from gcc -E of this code? ------- int π; -------- Currently it outputs: int \U000003c0; So curiously, it's as if C++ required translation of extended chars to UCNs is happening, so I think this output is actually potentially correct in C++ mode? But it is also this way in C mode which I think is probably not expected. It seems to come from cpp_output_token() which does not make use of the "original spelling" data structures. I am not sure about this one but probably the right solution is not much work, if someone knows what that might be? This is also the reason that one of the new testcases (gcc/testsuite/gcc.dg/cpp/ucnid-13-utf8.c) fails, this: #define Á 1 also preprocesses (in -E -dD) to include UCNs. I am not sure what is expected here. 8. There are tests (e.g. gcc/testsuite/gcc.dg/ucnid-10.c) which verify that when the locale is not utf8, diagnostics use UCNs instead of raw UTF8. I am not sure if this still makes sense when the files themselves contain UTF8, but that was the behavior that came out so I maintained these tests as well. Thanks for taking a look at this, and for all your work on gcc. I am happy to work on this if someone has any feedback on what I tried so far. -lewis