https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973
Lewis Hyatt <lhyatt at gmail dot com> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |lhyatt at gmail dot com --- Comment #13 from Lewis Hyatt <lhyatt at gmail dot com> --- Created attachment 46882 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46882&action=edit partial patch illustrating the intended approach Hello- I would like to help fix this up. I have another patch, hopefully being approved soon, that will implement support for extended characters in identifiers. If that sees some use, then this problem will become more noticeable, so it would be nice to address the column numbering issue at the same time. It's worth noting that there is a related problem not specifically mentioned so far in this bug report -- in addition to the column number, the caret output in diagnostic-show-locus.c also goes to the same wrong column. Attached for comment is a preliminary patch (no tests yet) that fixes the column number only (not the caret). If the overall approach seems sound, I am happy to flesh it out to handle all the fancy diagnostics too. I have a couple questions about the implementation, though, which I thought worthwhile to clear up before proceeding further. Here is a useful test case for this: ======= const char* smile = "😊😊😊😊😊😊😊😊"; int x = y; ======= Currently it outputs: t.cpp:1:65: error: ‘y’ was not declared in this scope const char* smile = "😊😊😊😊😊😊😊😊"; int x = y; ^ And indeed the column number and caret are both incorrect (but consistent with each other). Not visible here is that the 'y' is properly colorized in the source line. That works as-is because the ANSI color escapes should be inserted at the proper byte, not the proper display column. So the machinery in diagnostic-show-locus.c requires both the byte count and the logical column count in order to both colorize the source line and put the caret at the right place. Therefore the approach I chose was to keep most things exactly as they are -- all locations are tracked via byte counts just as now. Then all that's necessary is to compute the logical display column just when it's needed, at the time the diagnostic is output. This seems tenable because location_get_source_line() already lets us retrieve the line and work with it efficiently. That's the idea I went with here to adjust the column number; happy to hear any feedback before diving in to applying the same idea throughout diagnostic-show-locus.c. I have one other question though. This quick attempt uses wchar.h, namely the mbrtowc() and wcwidth() functions. Firstly, it seems unfortunate to introduce a dependency on those, which may be problematic for Windows, etc. (I borrowed the same configure checks used in intl.c for that.) But also, this will always convert multibyte input files using the user's locale, whereas GCC doesn't work like this when lexing files -- it ignores the locale and rather defaults to UTF-8 unless overridden by -finput-charset. It seems that in a perfect world diagnostics parsing of the source should work the same way. I feel like the most portable solution is just to use directly the necessary code (from glibc or gnulib or from scratch or wherever) to handle the wcwidth() functionality, and tweak it for this purpose. It's in essence just a binary search in a table. Basically I would convert the source line from the input charset to UTF-8 the same way the file is read on original input (using the facilities in libcpp/charset.c), and then I would just need a variant of wcwidth() that, rather than dealing with wchar_t and locales, simply takes UTF-8 input and returns the necessary width. Does this seem like a good approach? Otherwise, if I keep the use of wchar APIs, I think it would be necessary to make a temporary change to the locale when computing the display column, as dictated by the input charset. Note that this also brings up an unrelated question with -finput-charset. Say right now, I have a file in latin1 encoding and it is processed with -finput-charset=latin1, and then I compile it from a UTF-8 locale. If a source line is output in diagnostics, currently it gets output in the latin1 encoding and produces unreadable characters on the terminal. With changes along the lines I propose in the previous paragraph, the diagnostics would end up being output in UTF-8 (or whatever the current locale demands), which I think is maybe an improvement as well. Thanks for any suggestions. -Lewis