Hello David- I would appreciate hearing your thoughts please on the following (relatively minor) issue... There is a little bug with colorization in diagnostics: in case a location only points to the first byte of a multibyte sequence, the colorization produces corrupted output as it interrupts the UTF-8 sequence. This seems to happen with errors that come out of cpplib for instance. A test case is:
------- int ٩x; ------- compiled with -x c -std=c99. The fix for this issue is essentially two lines plus a test + comments, and I had originally submitted it here: https://gcc.gnu.org/legacy-ml/gcc-patches/2019-12/msg00915.html Subsequent to that, I also submitted patches to implement tab expansion in diagnostics. Those patches naturally fixed the colorization issue already in a different way, because they rewrote this whole function, so I indicated we might as well forget the simpler older patch. But then we decided to hold off on the tab expansion feature until GCC 11. That makes sense for sure, but then I think regarding the two-line fix for the colorization bug, there is still potentially a good case to push it now? It is not strictly a regression, but because we added support for UTF-8 identifiers in GCC 10, it is going to be more apparent now than it was before. Given that and the fact that it's a pretty small change, would it make sense to go ahead and get this in? For reference I attached the patch and ChangeLog again here as well. Bootstrap all languages on x86-64 linux looks good with all reg tests the same before + after: FAIL 94 94 PASS 473413 473413 UNSUPPORTED 11503 11503 UNTESTED 195 195 XFAIL 1818 1818 XPASS 36 36 Thanks! -Lewis
gcc/ChangeLog: 2020-04-15 Lewis Hyatt <lhy...@gmail.com> * diagnostic-show-locus.c (layout::print_source_line): Do not emit colorization codes in the middle of a UTF-8 sequence. (test_one_liner_colorized_utf8): New test. (test_diagnostic_show_locus_one_liner_utf8): Call the new test.
commit 5e2f6ec837fdf808ba20096bd632be9025f7526c Author: Lewis Hyatt <lhy...@gmail.com> Date: Wed Apr 15 11:17:33 2020 -0400 diagnostics: Fix colorization interrupting UTF-8 sequences diff --git a/gcc/diagnostic-show-locus.c b/gcc/diagnostic-show-locus.c index 4618b4edb7d..0aa42e236ee 100644 --- a/gcc/diagnostic-show-locus.c +++ b/gcc/diagnostic-show-locus.c @@ -1494,6 +1494,8 @@ layout::print_source_line (linenum_type row, const char *line, int line_bytes, int last_non_ws = 0; for (int col_byte = 1 + x_offset_bytes; col_byte <= line_bytes; col_byte++) { + char c = *line; + /* Assuming colorization is enabled for the caret and underline characters, we may also colorize the associated characters within the source line. @@ -1505,8 +1507,13 @@ layout::print_source_line (linenum_type row, const char *line, int line_bytes, For frontends that only generate carets, we don't colorize the characters above them, since this would look strange (e.g. - colorizing just the first character in a token). */ - if (m_colorize_source_p) + colorizing just the first character in a token). + + We need to avoid inserting color codes ahead of UTF-8 continuation + bytes to avoid corrupting the output, in case the location range + points only to the first byte of a multibyte sequence. */ + + if (m_colorize_source_p && (((unsigned int) c) & 0xC0) != 0x80) { bool in_range_p; point_state state; @@ -1519,7 +1526,6 @@ layout::print_source_line (linenum_type row, const char *line, int line_bytes, else m_colorizer.set_normal_text (); } - char c = *line; if (c == '\0' || c == '\t' || c == '\r') c = ' '; if (c != ' ') @@ -3854,6 +3860,27 @@ test_one_liner_labels_utf8 () } } +/* Make sure that colorization codes don't interrupt a multibyte + sequence, which would corrupt it. */ +static void +test_one_liner_colorized_utf8 () +{ + test_diagnostic_context dc; + dc.colorize_source_p = true; + diagnostic_color_init (&dc, DIAGNOSTICS_COLOR_YES); + const location_t pi = linemap_position_for_column (line_table, 12); + rich_location richloc (line_table, pi); + diagnostic_show_locus (&dc, &richloc, DK_ERROR); + + /* In order to avoid having the test depend on exactly how the colorization + was effected, just confirm there are two pi characters in the output. */ + const char *result = pp_formatted_text (dc.printer); + const char *null_term = result + strlen (result); + const char *first_pi = strstr (result, "\xcf\x80"); + ASSERT_TRUE (first_pi && first_pi <= null_term - 2); + ASSERT_STR_CONTAINS (first_pi + 2, "\xcf\x80"); +} + /* Run the various one-liner tests. */ static void @@ -3900,6 +3927,7 @@ test_diagnostic_show_locus_one_liner_utf8 (const line_table_case &case_) test_one_liner_many_fixits_1_utf8 (); test_one_liner_many_fixits_2_utf8 (); test_one_liner_labels_utf8 (); + test_one_liner_colorized_utf8 (); } /* Verify that gcc_rich_location::add_location_if_nearby works. */