https://gcc.gnu.org/bugzilla/show_bug.cgi?id=61896
Tom Honermann <tom at honermann dot net> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |tom at honermann dot net --- Comment #1 from Tom Honermann <tom at honermann dot net> --- Created attachment 38565 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38565&action=edit Source file with ill-formed UTF-8 code unit sequences Gcc's incorrect documentation regarding the default input character set continues to be a source of confusion. See the discussion on the C++ std-proposals list at the following link (search for 'locale'). https://groups.google.com/a/isocpp.org/forum/#!searchin/std-proposals/Draft$20proposal$20of$20file$20string/std-proposals/tKioR8OUiAw/85NCUmojBwAJ The current gcc 6.1.0 documentation for -finput-charset can be found here: https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/Preprocessor-Options.html#Preprocessor-Options The relevant text is: -finput-charset=charset Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. If the locale does not specify, or GCC cannot get this information from the locale, the default is UTF-8. This can be overridden by either the locale or this command-line option. Currently the command-line option takes precedence if there's a conflict. charset can be any encoding supported by the system's iconv library routine. The patch proposed in attachment 33179 in comment 0 is an improvement in that it removes the incorrect references to use of the current locale in determining the input character set. However, the proposed documentation is still incorrect, or at least imprecise, with regard to use of UTF-8 as the default input character set since gcc does not reject (or even emit a warning for) ill-formed UTF-8 text. An example follows. The attached test code (attached to prevent mutation of the contents) contains ill-formed UTF-8 code unit sequences. Compilation with gcc 6.1.0 (on a Linux system) succeeds despite the ill-formed input. # To demonstrate that the text is ill-formed: $ iconv -f utf-8 -t utf-8 t.cpp #include <cstdio> int main() { printf("narrow string: (well-formed UTF-8)\n"); for (unsigned char c : "£") { // 0xC2 0xA3 printf(" 0x%X\n", (unsigned int)c); } printf("narrow string: (ill-formed UTF-8)\n"); for (unsigned char c : "iconv: illegal input sequence at position 261 $ g++ --version g++ (GCC) 6.1.0 ... $ g++ -Wall -Wextra -pedantic t.cpp -o t; echo $? 0 $ ./t narrow string: (well-formed UTF-8) 0xC2 0xA3 0x0 narrow string: (ill-formed UTF-8) 0xA3 0x0 narrow string (hex escape): 0xA3 0x0 UTF-8 string: (well-formed UTF-8) 0xC2 0xA3 0x0 UTF-8 string: (ill-formed UTF-8) 0xA3 0x0 UTF-8 string (hex escape): 0xA3 0x0 As shown above, ill-formed code unit sequences are passed through without being transcoded to the execution character set (I would expect an error or translation to a replacement character for the ill-formed sequences). Note that validation is performed if a non-utf-8 execution character set is specified. $ g++ -Wall -Wextra -pedantic -fexec-charset=iso8859-1 t.cpp -o t t.cpp: In function ‘int main()’: t.cpp:9:28: error: converting to execution character set: Invalid or incomplete multibyte or wide character for (unsigned char c : "�") { // 0xA3 ^~~ I propose the documentation be updated to reflect this behavior: -finput-charset=charset Set the input character set, used for translation from the character set of the input file to the source character set used by GCC. The default input character set is UTF-8. charset can be any encoding supported by the system's iconv library routine. If the input character set matches the execution character set, then ill-formed code unit sequences are passed through without validation or translation. Otherwise, ill-formed code unit sequences will result in an error during transcoding to the execution character set.