[Bug preprocessor/49973] Column numbers count multibyte characters as multiple columns

lhyatt at gmail dot com Tue, 17 Sep 2019 15:35:54 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=49973


--- Comment #16 from Lewis Hyatt <lhyatt at gmail dot com> ---
Thank you both for the feedback so far. Regarding the use of wcwidth(), one
thing I noticed is that glibc has a much different result than gnulib does, say
for instance emojis return width 2 in the former rather than 1. (Which seems
better based on what I can tell.) It seems that glibc has undergone a fair
amount of tweaking to match what applications expect and so what it provides is
not coming directly from parsing the Unicode specs, although that's probably
the bulk of it. But I wonder, perhaps this is a sign that it might be better to
just make use of glibc and not try to add in a third implementation to the mix?

In any case, the underlying source of wcwidth() could easily be changed as a
drop-in replacement so I guess it can also be decided later. The use of
mbrtowc() is the bigger problem, since this converts from the user's locale and
it needs to convert from what -finput-charset asked for (or else UTF-8)
instead.

I have a more or less fully-baked patch at this point, that fixes up all
diagnostics that I am aware of (changes mostly in diagnostic.c and
diagnostic-show-locus.c) to be multi-byte aware. That includes column numbers,
carets, annotations, notes, fixit hints, etc. The patch still ignores the
input-charset issue and uses mbrtowc(), so that is the last thing for me to add
before I think it is worth sharing. I was wondering if I could get some advice
as to where to start here please?

It seems that basically location_get_source_line() in input.c needs to return
the lines converted to UTF-8, since all parsing has been working with the lines
in this form, and all the byte offsets they populated rich_locations with, etc,
are relative to the converted data too. I am not sure what's the correct way
though for location_get_source_line() to know the value of the -finput-charset
option. Typically this is inspected from a cpp_reader object, but none is
available in the context where this runs, that I understand anyway. It seems
that in order to make use of the existing conversion machinery in
libcpp/charset.c, I need to have a cpp_reader instance available too.
Appreciate any suggestions here. Thanks!

-Lewis

[Bug preprocessor/49973] Column numbers count multibyte characters as multiple columns

Reply via email to