The encoding of GCC's stderr

David Malcolm via Gcc Tue, 17 Nov 2020 08:59:09 -0800

As far as I can tell, GCC's diagnostic output on stderr is a mixture of
bytes from various different places in our internal representation:
- filenames
- format strings from diagnostic messages (potentially translated via
.po files)
- identifiers
- quoted source code
- fix-it hints
- labels


As noted in https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html
source files can be in any character set, specified by -finput-charset=, and 
libcpp converts that to the "source character set", Unicode, encoding it 
internally as UTF-8.  String and character constants are then converted to the 
execution character set (defaulting to UTF-8-encoded Unicode).  In many places 
we use identifier_to_locale to convert from the "internal encoding" to the 
locale character set, falling back to converting non-ASCII characters to UCNs.  
I suspect that there are numerous places where we're not doing that, but ought 
to be.

The only test coverage I could find for -finput-charset is
gcc.dg/ucnid-16-utf8.c, which has a latin1-encoded source file, and
verifies that a latin-1 encoded variable name becomes UTF-8 encoded in
the resulting .s file.  I shudder to imagine a DejaGnu test for a
source encoding that's not a superset of ASCII (e.g. UCS-4) - how would
the dg directives be handled?  I wonder if DejaGnu can support tests in
which the compiler's locale is overridden with environment variables
(and thus having e.g. non-ASCII/non-UTF-8 output).

What is the intended encoding of GCC's stderr?

In gcc_init_libintl we call:

#if defined HAVE_LANGINFO_CODESET
  locale_encoding = nl_langinfo (CODESET);
  if (locale_encoding != NULL
      && (!strcasecmp (locale_encoding, "utf-8")
          || !strcasecmp (locale_encoding, "utf8")))
    locale_utf8 = true;
#endif

so presumably stderr ought to be nl_langinfo (CODESET).

We use the above to potentially use the UTF-8 encoding of U+2018 and
U+2019 for open/close quotes, falling back to ASCII for these.

As far as I can tell, we currently:
- blithely accept and emit filenames as bytes (I don't think we make
any attempt to enforce that they're any particular encoding)
- emit format strings in whatever encoding gettext gives us
- emit identifiers as char * from IDENTIFIER_POINTER, calling
identifier_to_locale on them in many places, but I suspect we're
missing some
- blithely emit quoted source code as raw bytes (this is PR
other/93067, which has an old patch attached; presumably the source
ought to be emitted to stderr in the locale encoding)
- fix-it hints can contain identifiers as char * from
IDENTIFIER_POINTERs, which is likely UTF-8; I think I'm failing to call
identifier_to_locale on them
- labels can contain type names, which are likely UTF-8, and I'm
probably failing to call identifier_to_locale on them

So I think our current policy is:
- we assume filenames are encoded in the locale encoding, and pass them
through as bytes with no encode/decode
- we emit to stderr in the locale encoding (but there are likely bugs
where we don't re-encode from UTF-8 to the locale encoding)

Does this sound correct?

My motivation here is the discussion in [1] and [2] of supporting Emacs
via an alternative output format for machine-readable fix-it hints,
which has made me realize that I didn't understand our current approach
to encodings as well as I would like.

Hope this is constructive
Dave

[1] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=25987
[2] https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559105.html

The encoding of GCC's stderr

Reply via email to