Hello,

I have mentioned in my last message in `Inconsistent behavior of btowc with "C" 
locale` thread that I would like to start a new thread to discuss CRT 
conversion functions mb*towc* and wc*tomb*, and probably other functions which 
have to deal with locales.

While CRT's locale functions *work*, there are many questionable and 
problematic behaviors implemented by CRT. Thing get only worse when we consider 
their interaction with Win32 APIs.

Recently I started working on a project of mine which implements POSIX APIs on 
top of Win32 and CRT.  I started it by implementing *locale* module and I am 
still working on it.

In this message, I want to summarize what I have learned and discovered about 
Win32 and CRT locales. This message could be used as reference in the future.

1. Win32

1.1. LCID

Prior to Windows Vista, Windows locales were represented using LCID objects. 
They are just DWORDs which are constructed from LANG_* and SUBLANG_* constants 
using macros like MAKELCID[1].

The LCID mechanism has severe limitations and inconsistencies. There are large 
comments in Microsoft's header files addressing them.

Today, LCID locales should only be of interest if you need to support ancient 
versions of Windows.

1.2. Locale Names

Windows Vista introduced new APIs which use locale names instead of LCID 
objects. Windows locale names have the following format:

    ll[-ssss][-cc][_xxxxxx]

where:

- ll is an ISO 639 language code
- ssss is a language script (e.g. Latn for Latin or Cyrl for Cyrillic)
- cc is an ISO 3166 country code
- xxxxxx is a sort order identifier[2]

Windows ll-cc format is similar to common ll_CC format used by setlocale(3) on 
Unix/Linux.

1.3. Ansi and Unicode APIs

Many Win32 functions have two versions:

- one that takes an LPSTR argument (CHAR), they have A suffix (e.g. 
CreateDirectoryA)
- one that takes an LPWSTR argument (WCHAR), they have W suffix (e.g. 
CreateDirectoryW)

Most functions which operate on CHAR use code page returned by GetACP[3]. There 
are some exceptions like I/O functions which can use GetOEMCP[4] instead 
depending on return value of AreFileApisANSI[5].

1.4. Code Pages

Each locale has default code pages (ANSI, OEM and MAC) associated with it. I 
will refer only to ANSI code pages.

Default ANSI code page can be obtained by calling GetLocaleInfo[Ex] with 
LOCALE_IDEFAULTANSICODEPAGE. The following code pages may be returned:

- 874 (Thai)
- 932 (Japanese)
- 936 (Chinese Simplified)
- 949 (Korean)
- 950 (Chinese Traditional)
- 1250
- 1251
- 1252
- 1253 (Greek)
- 1254
- 1255 (Hebrew)
- 1256
- 1257 (Estonian, Latvian and Lithuanian)
- 1258 (Vietnamese)

There is a special case when GetLocaleInfo[Ex] returns 0 which indicates that 
locale requires Unicode.

Most of these are single-byte character sets. The DBCS ones are 932, 936, 949 
and 950. (There are more DBCS code pages).

Prior to Windows 10 GetACP would only return one of code pages listed above. 
Starting with Windows 10 version 1803 (if I am correct), GetACP can also return 
65001 (UTF-8) if either UTF-8 is globally enabled or application uses UTF-8 
manifest.

2. CRT

2.1. Charset

MSVCRT does not support UTF-8. Based on my tests, it seems to support all SBCS 
and DBCS code pages supported by operating system.

Additionally, UCRT seems to *support* ISO-2022-* charsets (code pages 
50220-50229), ISCII charsets (code pages 57002-57011) and GB18030 (code page 
54936). However, they seem to be broken. Conversion functions fail to convert 
even simple ASCII strings.

For simplicity, I always assume that UCRT supports UTF-8.

2.2. Thread Locales

msvcr80.dll introduced support for thread locales with _configthreadlocale[6] 
function.

Makes implementing POSIX uselocale(3) easier :).

2.3. _locale_t

msvcr80.dll introduced _locale_t type which is used to represent CRT locales. 
It also introduced many functions with _l suffix which accept such an object to 
operate on specific locale.

Some versions of msvcrt.dll after Windows XP contain functions with _l suffix 
but lack _create_locale[7] function to create such on object rendering them 
useless. Some later versions of msvcrt.dll contain _create_locale function.

Interesting detail to note is that all CRT before msvcr80.dll were released 
before Windows Vista. This means they lack thread locales and _locale_t, but at 
the same time only support locales representable with LCID objects. I like to 
treat any msvcrt.dll as pre-msvcr80.dll regardless of Windows version.

2.4. mbctype.h and mbstring.h

msvcrt20.dll introduced mbctype.h and mbstring.h. Those functions are similar 
to those in ctype.h and string.h, but designed to operate on multibyte 
characters (SBCS and DBCS).

Those functions are independent of locale set with setlocale. Instead they use 
code page set with _setmbcp function[8].

I mention this to point out that we should never call them. Code page they 
operate on must be in full control by application that uses them. Also, they 
cannot be used in UWP Applications[9].

Starting with msvcr80.dll these functions also support thread locales (code 
page set by _setmbcp) and have version with _l suffix.

3. Issues

3.1. LPSTR vs char*

Win32 APIs interpret CHAR using code page returned by GetACP (most of the 
time). CRT interprets char using locale set for LC_CTYPE category (for 
msvcr80.dll and newer, each thread might use a different locale :).

In most cases, calling setlocale(LC_ALL,"") will set locale using the same code 
page as returned by GetACP. There is one notable exception: MSVCRT on systems 
where UTF-8 is enabled globally (like mine).

It is only safe to pass char* from CRT to Win32 APIs when you can be sure that 
they use the same code page. As mentioned above it cannot be guaranteed even 
with setlocale(LC_ALL,"").

There are worse things: CRT functions which directly call Win32 APIs. For 
example, _open and fopen.

You would expect them to handle filenames correctly based on LC_CTYPE, but they 
do not. _open and fopen (and possibly many others) seem to pass filenames 
directly to Win32 ANSI functions.

Consider the following example which can happen with MSVCRT: GetACP == CP_UTF8 
and LC_CTYPE is Japanese_Japan.932 (using default code page for Japanese). If 
you pass a valid string (using code page 932) containing kanji to _open or 
fopen it will create file with broken name. If you pass the same string as 
UTF-8, it will create file with correct filename. Try it :).

Simplest solution I see is wrapping such calls by converting char* to wchar_t* 
and calling wide version of the same CRT function. Luckily, Microsoft seems 
obsessed with wchar_t and provides wide version for nearly (if not) each 
function (may not be the case for older CRTs. *sigh*).

3.2. Locale Categories

Each locale category might use different code page. For example,

    setlocale(LC_ALL, "LC_CTYPE=english;LC_TIME=japanese")

will set LC_CTYPE to English_United States.1252 and LC_TIME to 
Japanese_Japan.932. strftime will produce string which uses code page 932. Now 
use your imagination.

To be fair, this seems to be the thing even on Linux/Unix (e.g. glibc). But it 
makes situation on Windows even worse.

3.3. Lossy Conversion

CRTs wc*tomb* functions use lossy (best-fit) conversion. This makes them unsafe 
in various situations such as converting filenames from wchar_t* to char*.

A simple wrapper around WideCharToMultiByte which losslessly converts the whole 
string at once and allocates buffer for it could be used internally.

I think one reason why those functions use lossy conversion is to allow convert 
most of information obtained by GetLocaleInfo[Ex] to locale's default code 
page. There are such locales, and there are many of them. They are some locales 
where currency symbol and other information cannot be represented at all, even 
with lossy conversion.

3.4. setlocale

The way CRT's setlocale (and _create_locale) interprets locale string may lead 
to surprises. Particularly, when only part of locale string is valid.

Consider the following locale strings:

- Japanese_Korea
- Korean_Japan

One could expect first string to set locale to Japanese_Japan.932 and second to 
Korean_Korea.949. In practice, result will be the opposite. This behavior is 
documented (somewhere). Documentation claims to set locale to default language 
associated with the country.

However, consider following funny locale strings:

- Japanese_United States
- German_United Kingdom

You would expect English_United States.1252 and English_United Kingdom.1252? 
CRT would like to surprise you with Cherokee_United States and Welsh_United 
Kingdom.

No way it can be fixed in mingw-w64.

3.5. Unicode Locales with MSVCRT

A special case when GetLocaleInfo[Ex] returns 0 for default ANSI code page was 
mentioned earlier. An example of such locale is Armenian_Armenia. If you call 
setlocale(LC_ALL, "armenian"), it will set locale to

LC_COLLATE=Armenian_Armenia.65001;LC_CTYPE=C;LC_MONETARY=Armenian_Armenia.65001;LC_NUMERIC=Armenian_Armenia.65001;LC_TIME=Armenian_Armenia.65001

Note `LC_CTYPE=C`. I consider this broken.

There are no issues with UCRT since UTF-8 is supported: Armenian_Armenia.utf8.

No way it can be fixed in mingw-w64.

------

I could have missed something, but this should be enough to have an idea how 
locales work on Windows and what kinds of issues we have in CRT.

- Kirill Makurin

[1] https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-makelcid
[2] https://learn.microsoft.com/en-us/windows/win32/intl/sort-order-identifiers
[3] https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp
[4] 
https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getoemcp
[5] 
https://learn.microsoft.com/en-us/windows/win32/api/FileAPI/nf-fileapi-arefileapisansi
[6] 
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/configthreadlocale<https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/configthreadlocale?view=msvc-170>
[7] 
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/create-locale-wcreate-locale<https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/create-locale-wcreate-locale?view=msvc-170>
[8] 
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setmbcp<https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setmbcp?view=msvc-170>
[9] 
https://learn.microsoft.com/en-us/cpp/cppcx/crt-functions-not-supported-in-universal-windows-platform-apps<https://learn.microsoft.com/en-us/cpp/cppcx/crt-functions-not-supported-in-universal-windows-platform-apps?view=msvc-170>


_______________________________________________
Mingw-w64-public mailing list
Mingw-w64-public@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/mingw-w64-public

Reply via email to