[Bug preprocessor/49973] Column numbers count multibyte characters as multiple columns

lhyatt at gmail dot com Sat, 14 Sep 2019 21:22:06 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973


Lewis Hyatt <lhyatt at gmail dot com> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |lhyatt at gmail dot com

--- Comment #13 from Lewis Hyatt <lhyatt at gmail dot com> ---
Created attachment 46882
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=46882&action=edit
partial patch illustrating the intended approach

Hello-

I would like to help fix this up. I have another patch, hopefully being
approved soon, that will implement support for extended characters in
identifiers. If that sees some use, then this problem will become more
noticeable, so it would be nice to address the column numbering issue at the
same time.

It's worth noting that there is a related problem not specifically mentioned so
far in this bug report -- in addition to the column number, the caret output in
diagnostic-show-locus.c also goes to the same wrong column.

Attached for comment is a preliminary patch (no tests yet) that fixes the
column number only (not the caret). If the overall approach seems sound, I am
happy to flesh it out to handle all the fancy diagnostics too. I have a couple
questions about the implementation, though, which I thought worthwhile to clear
up before proceeding further.

Here is a useful test case for this:

=======
const char* smile = "😊😊😊😊😊😊😊😊"; int x = y;
=======

Currently it outputs:

t.cpp:1:65: error: ‘y’ was not declared in this scope
 const char* smile = "😊😊😊😊😊😊😊😊"; int x = y;
                                                                 ^

And indeed the column number and caret are both incorrect (but consistent with
each other). Not visible here is that the 'y' is properly colorized in the
source line. That works as-is because the ANSI color escapes should be inserted
at the proper byte, not the proper display column. So the machinery in
diagnostic-show-locus.c requires both the byte count and the logical column
count in order to both colorize the source line and put the caret at the right
place. 

Therefore the approach I chose was to keep most things exactly as they are --
all locations are tracked via byte counts just as now. Then all that's
necessary is to compute the logical display column just when it's needed, at
the time the diagnostic is output. This seems tenable because
location_get_source_line() already lets us retrieve the line and work with it
efficiently. That's the idea I went with here to adjust the column number;
happy to hear any feedback before diving in to applying the same idea
throughout diagnostic-show-locus.c.

I have one other question though. This quick attempt uses wchar.h, namely the
mbrtowc() and wcwidth() functions. Firstly, it seems unfortunate to introduce a
dependency on those, which may be problematic for Windows, etc. (I borrowed the
same configure checks used in intl.c for that.) But also, this will always
convert multibyte input files using the user's locale, whereas GCC doesn't work
like this when lexing files -- it ignores the locale and rather defaults to
UTF-8 unless overridden by -finput-charset. It seems that in a perfect world
diagnostics parsing of the source should work the same way.

I feel like the most portable solution is just to use directly the necessary
code (from glibc or gnulib or from scratch or wherever) to handle the wcwidth()
functionality, and tweak it for this purpose. It's in essence just a binary
search in a table. Basically I would convert the source line from the input
charset to UTF-8 the same way the file is read on original input (using the
facilities in libcpp/charset.c), and then I would just need a variant of
wcwidth() that, rather than dealing with wchar_t and locales, simply takes
UTF-8 input and returns the necessary width. Does this seem like a good
approach? Otherwise, if I keep the use of wchar APIs, I think it would be
necessary to make a temporary change to the locale when computing the display
column, as dictated by the input charset.

Note that this also brings up an unrelated question with -finput-charset. Say
right now, I have a file in latin1 encoding and it is processed with
-finput-charset=latin1, and then I compile it from a UTF-8 locale. If a source
line is output in diagnostics, currently it gets output in the latin1 encoding
and produces unreadable characters on the terminal. With changes along the
lines I propose in the previous paragraph, the diagnostics would end up being
output in UTF-8 (or whatever the current locale demands), which I think is
maybe an improvement as well.

Thanks for any suggestions.

-Lewis

[Bug preprocessor/49973] Column numbers count multibyte characters as multiple columns

Reply via email to