Ross Ridge wrote:
Nicolas De Rico wrote:
The file hi-utf16.c, created with Notepad and saved in "unicode",
contains a BOM which is, in essence, a small header at the beginning of
the file that indicates the encoding.
It's not a header that indicates the encoding. It's a header that
indicates the byte order of the 16-bit values that follow when the
encoding is already known to be UTF-16. When then encoding is known
to be UTF-16LE or UTF-16BE there shouldn't be any "BOM" present at the
start of a C file, since a "BOM" in the correct byte order is actually
the Unicode "zero-width non-breaking space" character, which isn't valid
as the first character in a C file. Similarly, there shouldn't be a
BOM mark at the start of a UTF-8 C file, especially since UTF-8 encoded
files don't have a byte-order.
The presence of what looks to be UTF-16 BOM header can be used a part
of a heuristic to guess the encoding of file, but I don't think it's a
good idea for GCC to be guessing the encoding of files.
Of course, stdio.h is stored in UTF-8 on the system so trying to convert
it from UTF-16 will fail right away.
It would probably be more accurate to describe "stdio.h" as an ASCII file.
It's true that stdio.h is ascii. I wasn't thinking properly, files
saved in UTF-8 or LATIN-1 (for example) should compile properly if using
the proper setting.
But how can someone compile using gcc a file created with Visual C++ and
saved in unicode?
Microsoft puts a BOM for UTF-16 files. It even does so for UTF-8 files
that are saved with Notepad (this can be confirmed using 'od -x'). This
allows their programs to detect the encoding automatically. Note that
vim seems to be able to detect the encoding using the BOM.