As far as I can tell, GCC's diagnostic output on stderr is a mixture of bytes from various different places in our internal representation: - filenames - format strings from diagnostic messages (potentially translated via .po files) - identifiers - quoted source code - fix-it hints - labels
As noted in https://gcc.gnu.org/onlinedocs/cpp/Character-sets.html source files can be in any character set, specified by -finput-charset=, and libcpp converts that to the "source character set", Unicode, encoding it internally as UTF-8. String and character constants are then converted to the execution character set (defaulting to UTF-8-encoded Unicode). In many places we use identifier_to_locale to convert from the "internal encoding" to the locale character set, falling back to converting non-ASCII characters to UCNs. I suspect that there are numerous places where we're not doing that, but ought to be. The only test coverage I could find for -finput-charset is gcc.dg/ucnid-16-utf8.c, which has a latin1-encoded source file, and verifies that a latin-1 encoded variable name becomes UTF-8 encoded in the resulting .s file. I shudder to imagine a DejaGnu test for a source encoding that's not a superset of ASCII (e.g. UCS-4) - how would the dg directives be handled? I wonder if DejaGnu can support tests in which the compiler's locale is overridden with environment variables (and thus having e.g. non-ASCII/non-UTF-8 output). What is the intended encoding of GCC's stderr? In gcc_init_libintl we call: #if defined HAVE_LANGINFO_CODESET locale_encoding = nl_langinfo (CODESET); if (locale_encoding != NULL && (!strcasecmp (locale_encoding, "utf-8") || !strcasecmp (locale_encoding, "utf8"))) locale_utf8 = true; #endif so presumably stderr ought to be nl_langinfo (CODESET). We use the above to potentially use the UTF-8 encoding of U+2018 and U+2019 for open/close quotes, falling back to ASCII for these. As far as I can tell, we currently: - blithely accept and emit filenames as bytes (I don't think we make any attempt to enforce that they're any particular encoding) - emit format strings in whatever encoding gettext gives us - emit identifiers as char * from IDENTIFIER_POINTER, calling identifier_to_locale on them in many places, but I suspect we're missing some - blithely emit quoted source code as raw bytes (this is PR other/93067, which has an old patch attached; presumably the source ought to be emitted to stderr in the locale encoding) - fix-it hints can contain identifiers as char * from IDENTIFIER_POINTERs, which is likely UTF-8; I think I'm failing to call identifier_to_locale on them - labels can contain type names, which are likely UTF-8, and I'm probably failing to call identifier_to_locale on them So I think our current policy is: - we assume filenames are encoded in the locale encoding, and pass them through as bytes with no encode/decode - we emit to stderr in the locale encoding (but there are likely bugs where we don't re-encode from UTF-8 to the locale encoding) Does this sound correct? My motivation here is the discussion in [1] and [2] of supporting Emacs via an alternative output format for machine-readable fix-it hints, which has made me realize that I didn't understand our current approach to encodings as well as I would like. Hope this is constructive Dave [1] https://debbugs.gnu.org/cgi/bugreport.cgi?bug=25987 [2] https://gcc.gnu.org/pipermail/gcc-patches/2020-November/559105.html