Hi Bruno,

This is some interesting information. In the recent months I was working on 
windows-specific implementation of POSIX newlocale/uselocale functions as well 
as replacements for locale-related CRT functions.

I have collected some information about locale-related features in Windows API 
and CRT. I have posted it to mingw-w64 list and you can find it here if you're 
interested[1].

For example, are you aware that CRTs starting with msvcr80.dll (in particular 
UCRT) natively support thread locales? I was wondering if libintl/libiconv take 
this into account, because if not, this may lead to unexpected surprises.

- Kirill Makurin

[1] https://sourceforge.net/p/mingw-w64/mailman/message/59198335/
________________________________
From: [email protected] 
<[email protected]> on behalf of Bruno Haible 
via Gnulib discussion list <[email protected]>
Sent: Wednesday, September 17, 2025 12:45 AM
To: [email protected] <[email protected]>
Cc: Michele Locati <[email protected]>; Eli Zaretskii <[email protected]>
Subject: Document msvcrt (native Windows) bugs regarding console output

The stdio output functions have two bugs when it comes to output
to a Windows console.

Windows consoles come with two encodings: GetACP() and GetOEMCP(). For
Japanese, both have the same value (932). However, for English, German,
French Windows installations, GETACP() = 1252 and GetOEMCP() = 850.
For many years, output of non-ASCII characters to consoles was a PITA:
While the program had to produce output in GetACP() encoding when
writing to files, it had to produce output in GetOEMCP() encoding when
writing to a console. The majority of programs did not do this: they
produced output in GetACP() encoding always, and thus non-ASCII
characters got garbled in consoles.

After many many years, Microsoft finally added a workaround in the
C runtime library (msvcrt and ucrt). When a program writes a string
to a console, the runtime library tests whether the output goes
to a console, and if yes, it does a conversion from GetACP() encoding
to GetOEMCP() encoding on the fly, in two steps: from GetACP() to UTF-16
via MultiByteToWideChar, then to GetOEMCP() via WideCharToMultiByte.

This workaround works fine in ucrt. But in msvcrt this workaround
has two bugs. Both happen when
  - The output goes to a console. (No bug when the output goes to a file.)
and
  - The stream's mode is _O_TEXT. (Which is the default for stdout
    and stderr. No bug when the stream's mode is _O_BINARY.)
and
  - setlocale() is called before. (No bug if setlocale() is not called,
    that is, when the locale remains the "C" locale.)
and
  - The chosen locale has a double-byte encoding, such as CP932.
    (No bug for unibyte locale encodings, such as CP1252.)
and
  - The console's codepage matches the locale's encoding. For
    example, after 'chcp 932' was executed.

Bug 1:

When the application outputs double-byte characters one byte at
a time, using the functions fputc() or putc(), the console shows JISX0201
(ASCII and Katakana) characters instead of CP932 (ASCII, Katakana,
Hiragana, Hanzi) characters.

How to reproduce:
1. Use Windows 10 or 11. Switch it to Japanese as main language.
2. Use the attached program. In the dev environment:
  $ gcc -Wall foo.c
3. In a cmd.exe console:
  $ chcp 932
  $ .\a
Look at the output of the parts C and D.

Bug 2:

When the application outputs a string, that starts with a non-ASCII
character, using the function fwrite(), the console shows no output,
and the stream's error indicator gets set.

How to reproduce:
1. Use Windows 10 or 11. Switch it to Japanese as main language.
2. Use the attached program. In the dev environment:
  $ gcc -Wall foo.c
3. In a cmd.exe console:
  $ chcp 932
  $ .\a
Look at the output of the parts E and F.

I don't plan to add workarounds for these bugs to Gnulib, because
* Normal applications don't write strings one byte at a time, for
  speed.
* Normal applications use fwrite() for binary I/O and fputs() or
  [v][f]printf or similar for text I/O.

If anyone wants these bugs fixed, they will have to build their
application against ucrt instead of msvcrt. The MSYS2 project
contains tools and libraries for mingw+ucrt. (Btw, building with
ucrt instead of msvcrt also has the benefit of supporting the
UTF-8 locales of Windows. [1][2])

[1] 
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale
    "Starting in Windows 10 version 1803 (10.0.17134.0), the Universal C
     Runtime supports using a UTF-8 code page."
[2] https://lists.gnu.org/archive/html/bug-gnulib/2024-12/msg00159.html


2025-09-16  Bruno Haible  <[email protected]>

        Document msvcrt (native Windows) bugs regarding console output.
        * doc/posix-functions/fputc.texi: Document a bug found in msvcrt.
        * doc/posix-functions/putc.texi: Likewise.
        * doc/posix-functions/fwrite.texi: Document another bug found in msvcrt.

Reply via email to