Hello- In the original discussion of implementing UTF-8 identifiers ( https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67224#c26 ), I pointed out that colorization would corrupt the appearance of certain diagnostics. For example, this code, with -std=c99:
---------- int ٩x; ---------- Produces: t2.cpp:1:5: error: extended character ٩ is not valid at the start of an identifier 1 | int ٩x; | ^ The diagnostic location contains only the first byte of the character, so when colorization is enabled, the ANSI escapes are inserted in the middle of the UTF-8 sequence and produce corrupted output on the terminal. I feel like there are two separate issues here: #1. diagnostic_show_locus() should be sure it will not corrupt output in this way, regardless of what ranges it is given to work with. #2. libcpp should probably generate a range that includes the whole UTF-8 character. Actually in other ways the range seems not ideal, for example if an invalid character appears in the middle of the identifier, the diagnostic still points to the first byte of the identifier. The attached patch fixes #1. It's essentially a one-line change, plus a new selftest. Would you please have a look at it sometime? bootstrap and testsuite were done on linux x86-64. Other questions that I have: - I am not quite clear when a selftest is preferred vs a dejagnu test. In this case I stuck with the selftest because color diagnostics don't seem to work well with dg-error etc, and it didn't seem worth creating a new plugin-based test like g++.dg/plugin just for this. (I also considered using the existing g++.dg plugin, but it seems this test should run for gcc as well.) - I wasn't sure if I should create a PR for an issue such as this, if there is already a patch readily available. And if I did create a PR, not sure if it's preferred to post the patch to gcc-patches, or as an attachment to the PR. - Does it seem worth me looking into #2? I think the patch to address #1 is appropriate in any case, because it handles generically all potential cases where this may arise, but still perhaps the ranges coming out of libcpp could be improved? Thanks... -Lewis
gcc/ChangeLog: 2019-12-12 Lewis Hyatt <lhy...@gmail.com> * diagnostic-show-locus.c (layout::print_source_line): Do not emit color codes in the middle of a UTF-8 sequence. (test_one_liner_colorized_utf8): New test. (test_diagnostic_show_locus_one_liner_utf8): Call the new test.
diff --git a/gcc/diagnostic-show-locus.c b/gcc/diagnostic-show-locus.c index c87603caf41..7385b8a1692 100644 --- a/gcc/diagnostic-show-locus.c +++ b/gcc/diagnostic-show-locus.c @@ -1482,6 +1482,8 @@ layout::print_source_line (linenum_type row, const char *line, int line_bytes, int last_non_ws = 0; for (int col_byte = 1 + x_offset_bytes; col_byte <= line_bytes; col_byte++) { + char c = *line; + /* Assuming colorization is enabled for the caret and underline characters, we may also colorize the associated characters within the source line. @@ -1493,8 +1495,13 @@ layout::print_source_line (linenum_type row, const char *line, int line_bytes, For frontends that only generate carets, we don't colorize the characters above them, since this would look strange (e.g. - colorizing just the first character in a token). */ - if (m_colorize_source_p) + colorizing just the first character in a token). + + We need to avoid inserting color codes ahead of UTF-8 continuation + bytes to avoid corrupting the output, in case the location range + points only to the first byte of a multibyte sequence. */ + + if (m_colorize_source_p && (((unsigned int) c) & 0xC0) != 0x80) { bool in_range_p; point_state state; @@ -1507,7 +1514,6 @@ layout::print_source_line (linenum_type row, const char *line, int line_bytes, else m_colorizer.set_normal_text (); } - char c = *line; if (c == '\0' || c == '\t' || c == '\r') c = ' '; if (c != ' ') @@ -3836,6 +3842,27 @@ test_one_liner_labels_utf8 () } } +/* Make sure that colorization codes don't interrupt a multibyte + sequence, which would corrupt it. */ +static void +test_one_liner_colorized_utf8 () +{ + test_diagnostic_context dc; + dc.colorize_source_p = true; + diagnostic_color_init (&dc, DIAGNOSTICS_COLOR_YES); + const location_t pi = linemap_position_for_column (line_table, 12); + rich_location richloc (line_table, pi); + diagnostic_show_locus (&dc, &richloc, DK_ERROR); + + /* In order to avoid having the test depend on exactly how the colorization + was effected, just confirm there are two pi characters in the output. */ + const char *result = pp_formatted_text (dc.printer); + const char *null_term = result + strlen (result); + const char *first_pi = strstr (result, "\xcf\x80"); + ASSERT_TRUE (first_pi && first_pi <= null_term - 2); + ASSERT_STR_CONTAINS (first_pi + 2, "\xcf\x80"); +} + /* Run the various one-liner tests. */ static void @@ -3882,6 +3909,7 @@ test_diagnostic_show_locus_one_liner_utf8 (const line_table_case &case_) test_one_liner_many_fixits_1_utf8 (); test_one_liner_many_fixits_2_utf8 (); test_one_liner_labels_utf8 (); + test_one_liner_colorized_utf8 (); } /* Verify that gcc_rich_location::add_location_if_nearby works. */