https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93067
Bug ID: 93067 Summary: diagnostics are not aware of -finput-charset Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: lhyatt at gmail dot com Target Milestone: --- Created attachment 47547 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=47547&action=edit Candidate patch Hello- When source lines are needed for diagnostics output, they are retrieved from the source file by the fcache infrastructure in input.c, since libcpp has generally already forgotten them already (plus not all front ends are using libcpp). This infrastructure does not read the files in the same way as libcpp does; in particular, it does not translate the encoding as requested by -finput-charset, and it does not strip a UTF-8 BOM if present. The attached patch adds this ability. My thinking in deciding how to do it was the following: - Use of -finput-charset is rare, and use of UTF-8 BOMs must be rarer still, so this patch should try hard not to introduce any worse performance unless these features are being used. - It is desirable to reuse libcpp's encoding infrastructure from charset.c rather than repeat it in input.c. (Notably, libcpp uses iconv but it also has hand-coded routines for certain charsets to make sure they are available.) - There is a performance degradation required in order to make use of libcpp directly, because the input.c infrastructure only reads as much of the source file as necessary, whereas libcpp interfaces require to read the entire file into memory. - It can't be quite as simple as just "only delegate to libcpp if -finput-charset was specified", because the stripping of the UTF-8 BOM has to happen with or without this option. - So it seemed a reasonable compromise to me, if -finput-charset is specified, then use libcpp to convert the file, otherwise, strip the BOM in input.c and then process the file the same way it is done now. There's a little bit of leakage of charset logic from libcpp this way, but it seems worthwhile, since otherwise, diagnostics would always be reading the entire file into memory, which is not a cost paid currently. Hope that makes some sense. One thing I wasn't sure of, is what's the right way to communicate to input.c that libcpp is being used to process the source files. I added a global variable for this because I didn't see any other way to do it without making drastic changes. This requires C-family and Fortran front-ends, and any future libcpp users, to set the variable; basically it makes explicit the fact that there's a single global cpp_reader object in use. If there's a less-bad way to do it, please let me know... If the global cpp_reader variable doesn't get set, then the behavior is the same as current. Separate from the attached patch are two testcases that both fail before this patch and pass after. I attached them gzipped because they use non-standard encodings. If this overall approach seems OK, please let me know and I can prepare it for gcc-patches. bootstrap + reg test look good on x86-64 linux. Thanks! -Lewis