[Bug other/93067] New: diagnostics are not aware of -finput-charset

lhyatt at gmail dot com Tue, 24 Dec 2019 11:15:58 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067


            Bug ID: 93067
           Summary: diagnostics are not aware of -finput-charset
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: other
          Assignee: unassigned at gcc dot gnu.org
          Reporter: lhyatt at gmail dot com
  Target Milestone: ---

Created attachment 47547
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47547&action=edit
Candidate patch

Hello-

When source lines are needed for diagnostics output, they are retrieved from
the source file by the fcache infrastructure in input.c, since libcpp has
generally already forgotten them already (plus not all front ends are using
libcpp). This infrastructure does not read the files in the same way as libcpp
does; in particular, it does not translate the encoding as requested by
-finput-charset, and it does not strip a UTF-8 BOM if present. The attached
patch adds this ability. My thinking in deciding how to do it was the
following:

- Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer still, so
this patch should try hard not to introduce any worse performance unless these
features are being used.

- It is desirable to reuse libcpp's encoding infrastructure from charset.c
rather than repeat it in input.c. (Notably, libcpp uses iconv but it also has
hand-coded routines for certain charsets to make sure they are available.)

- There is a performance degradation required in order to make use of libcpp
directly, because the input.c infrastructure only reads as much of the source
file as necessary, whereas libcpp interfaces require to read the entire file
into memory.

- It can't be quite as simple as just "only delegate to libcpp if
-finput-charset was specified", because the stripping of the UTF-8 BOM has to
happen with or without this option.

- So it seemed a reasonable compromise to me, if -finput-charset is specified,
then use libcpp to convert the file, otherwise, strip the BOM in input.c and
then process the file the same way it is done now. There's a little bit of
leakage of charset logic from libcpp this way, but it seems worthwhile, since
otherwise, diagnostics would always be reading the entire file into memory,
which is not a cost paid currently.

Hope that makes some sense. One thing I wasn't sure of, is what's the right way
to communicate to input.c that libcpp is being used to process the source
files. I added a global variable for this because I didn't see any other way to
do it without making drastic changes. This requires C-family and Fortran
front-ends, and any future libcpp users, to set the variable; basically it
makes explicit the fact that there's a single global cpp_reader object in use.
If there's a less-bad way to do it, please let me know... If the global
cpp_reader variable doesn't get set, then the behavior is the same as current.

Separate from the attached patch are two testcases that both fail before this
patch and pass after. I attached them gzipped because they use non-standard
encodings.

If this overall approach seems OK, please let me know and I can prepare it for
gcc-patches. bootstrap + reg test look good on x86-64 linux. Thanks!

-Lewis

[Bug other/93067] New: diagnostics are not aware of -finput-charset

Reply via email to