Hello, I have mentioned in my last message in `Inconsistent behavior of btowc with "C" locale` thread that I would like to start a new thread to discuss CRT conversion functions mb*towc* and wc*tomb*, and probably other functions which have to deal with locales.
While CRT's locale functions *work*, there are many questionable and problematic behaviors implemented by CRT. Thing get only worse when we consider their interaction with Win32 APIs. Recently I started working on a project of mine which implements POSIX APIs on top of Win32 and CRT. I started it by implementing *locale* module and I am still working on it. In this message, I want to summarize what I have learned and discovered about Win32 and CRT locales. This message could be used as reference in the future. 1. Win32 1.1. LCID Prior to Windows Vista, Windows locales were represented using LCID objects. They are just DWORDs which are constructed from LANG_* and SUBLANG_* constants using macros like MAKELCID[1]. The LCID mechanism has severe limitations and inconsistencies. There are large comments in Microsoft's header files addressing them. Today, LCID locales should only be of interest if you need to support ancient versions of Windows. 1.2. Locale Names Windows Vista introduced new APIs which use locale names instead of LCID objects. Windows locale names have the following format: ll[-ssss][-cc][_xxxxxx] where: - ll is an ISO 639 language code - ssss is a language script (e.g. Latn for Latin or Cyrl for Cyrillic) - cc is an ISO 3166 country code - xxxxxx is a sort order identifier[2] Windows ll-cc format is similar to common ll_CC format used by setlocale(3) on Unix/Linux. 1.3. Ansi and Unicode APIs Many Win32 functions have two versions: - one that takes an LPSTR argument (CHAR), they have A suffix (e.g. CreateDirectoryA) - one that takes an LPWSTR argument (WCHAR), they have W suffix (e.g. CreateDirectoryW) Most functions which operate on CHAR use code page returned by GetACP[3]. There are some exceptions like I/O functions which can use GetOEMCP[4] instead depending on return value of AreFileApisANSI[5]. 1.4. Code Pages Each locale has default code pages (ANSI, OEM and MAC) associated with it. I will refer only to ANSI code pages. Default ANSI code page can be obtained by calling GetLocaleInfo[Ex] with LOCALE_IDEFAULTANSICODEPAGE. The following code pages may be returned: - 874 (Thai) - 932 (Japanese) - 936 (Chinese Simplified) - 949 (Korean) - 950 (Chinese Traditional) - 1250 - 1251 - 1252 - 1253 (Greek) - 1254 - 1255 (Hebrew) - 1256 - 1257 (Estonian, Latvian and Lithuanian) - 1258 (Vietnamese) There is a special case when GetLocaleInfo[Ex] returns 0 which indicates that locale requires Unicode. Most of these are single-byte character sets. The DBCS ones are 932, 936, 949 and 950. (There are more DBCS code pages). Prior to Windows 10 GetACP would only return one of code pages listed above. Starting with Windows 10 version 1803 (if I am correct), GetACP can also return 65001 (UTF-8) if either UTF-8 is globally enabled or application uses UTF-8 manifest. 2. CRT 2.1. Charset MSVCRT does not support UTF-8. Based on my tests, it seems to support all SBCS and DBCS code pages supported by operating system. Additionally, UCRT seems to *support* ISO-2022-* charsets (code pages 50220-50229), ISCII charsets (code pages 57002-57011) and GB18030 (code page 54936). However, they seem to be broken. Conversion functions fail to convert even simple ASCII strings. For simplicity, I always assume that UCRT supports UTF-8. 2.2. Thread Locales msvcr80.dll introduced support for thread locales with _configthreadlocale[6] function. Makes implementing POSIX uselocale(3) easier :). 2.3. _locale_t msvcr80.dll introduced _locale_t type which is used to represent CRT locales. It also introduced many functions with _l suffix which accept such an object to operate on specific locale. Some versions of msvcrt.dll after Windows XP contain functions with _l suffix but lack _create_locale[7] function to create such on object rendering them useless. Some later versions of msvcrt.dll contain _create_locale function. Interesting detail to note is that all CRT before msvcr80.dll were released before Windows Vista. This means they lack thread locales and _locale_t, but at the same time only support locales representable with LCID objects. I like to treat any msvcrt.dll as pre-msvcr80.dll regardless of Windows version. 2.4. mbctype.h and mbstring.h msvcrt20.dll introduced mbctype.h and mbstring.h. Those functions are similar to those in ctype.h and string.h, but designed to operate on multibyte characters (SBCS and DBCS). Those functions are independent of locale set with setlocale. Instead they use code page set with _setmbcp function[8]. I mention this to point out that we should never call them. Code page they operate on must be in full control by application that uses them. Also, they cannot be used in UWP Applications[9]. Starting with msvcr80.dll these functions also support thread locales (code page set by _setmbcp) and have version with _l suffix. 3. Issues 3.1. LPSTR vs char* Win32 APIs interpret CHAR using code page returned by GetACP (most of the time). CRT interprets char using locale set for LC_CTYPE category (for msvcr80.dll and newer, each thread might use a different locale :). In most cases, calling setlocale(LC_ALL,"") will set locale using the same code page as returned by GetACP. There is one notable exception: MSVCRT on systems where UTF-8 is enabled globally (like mine). It is only safe to pass char* from CRT to Win32 APIs when you can be sure that they use the same code page. As mentioned above it cannot be guaranteed even with setlocale(LC_ALL,""). There are worse things: CRT functions which directly call Win32 APIs. For example, _open and fopen. You would expect them to handle filenames correctly based on LC_CTYPE, but they do not. _open and fopen (and possibly many others) seem to pass filenames directly to Win32 ANSI functions. Consider the following example which can happen with MSVCRT: GetACP == CP_UTF8 and LC_CTYPE is Japanese_Japan.932 (using default code page for Japanese). If you pass a valid string (using code page 932) containing kanji to _open or fopen it will create file with broken name. If you pass the same string as UTF-8, it will create file with correct filename. Try it :). Simplest solution I see is wrapping such calls by converting char* to wchar_t* and calling wide version of the same CRT function. Luckily, Microsoft seems obsessed with wchar_t and provides wide version for nearly (if not) each function (may not be the case for older CRTs. *sigh*). 3.2. Locale Categories Each locale category might use different code page. For example, setlocale(LC_ALL, "LC_CTYPE=english;LC_TIME=japanese") will set LC_CTYPE to English_United States.1252 and LC_TIME to Japanese_Japan.932. strftime will produce string which uses code page 932. Now use your imagination. To be fair, this seems to be the thing even on Linux/Unix (e.g. glibc). But it makes situation on Windows even worse. 3.3. Lossy Conversion CRTs wc*tomb* functions use lossy (best-fit) conversion. This makes them unsafe in various situations such as converting filenames from wchar_t* to char*. A simple wrapper around WideCharToMultiByte which losslessly converts the whole string at once and allocates buffer for it could be used internally. I think one reason why those functions use lossy conversion is to allow convert most of information obtained by GetLocaleInfo[Ex] to locale's default code page. There are such locales, and there are many of them. They are some locales where currency symbol and other information cannot be represented at all, even with lossy conversion. 3.4. setlocale The way CRT's setlocale (and _create_locale) interprets locale string may lead to surprises. Particularly, when only part of locale string is valid. Consider the following locale strings: - Japanese_Korea - Korean_Japan One could expect first string to set locale to Japanese_Japan.932 and second to Korean_Korea.949. In practice, result will be the opposite. This behavior is documented (somewhere). Documentation claims to set locale to default language associated with the country. However, consider following funny locale strings: - Japanese_United States - German_United Kingdom You would expect English_United States.1252 and English_United Kingdom.1252? CRT would like to surprise you with Cherokee_United States and Welsh_United Kingdom. No way it can be fixed in mingw-w64. 3.5. Unicode Locales with MSVCRT A special case when GetLocaleInfo[Ex] returns 0 for default ANSI code page was mentioned earlier. An example of such locale is Armenian_Armenia. If you call setlocale(LC_ALL, "armenian"), it will set locale to LC_COLLATE=Armenian_Armenia.65001;LC_CTYPE=C;LC_MONETARY=Armenian_Armenia.65001;LC_NUMERIC=Armenian_Armenia.65001;LC_TIME=Armenian_Armenia.65001 Note `LC_CTYPE=C`. I consider this broken. There are no issues with UCRT since UTF-8 is supported: Armenian_Armenia.utf8. No way it can be fixed in mingw-w64. ------ I could have missed something, but this should be enough to have an idea how locales work on Windows and what kinds of issues we have in CRT. - Kirill Makurin [1] https://learn.microsoft.com/en-us/windows/win32/api/winnt/nf-winnt-makelcid [2] https://learn.microsoft.com/en-us/windows/win32/intl/sort-order-identifiers [3] https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getacp [4] https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getoemcp [5] https://learn.microsoft.com/en-us/windows/win32/api/FileAPI/nf-fileapi-arefileapisansi [6] https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/configthreadlocale<https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/configthreadlocale?view=msvc-170> [7] https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/create-locale-wcreate-locale<https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/create-locale-wcreate-locale?view=msvc-170> [8] https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setmbcp<https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setmbcp?view=msvc-170> [9] https://learn.microsoft.com/en-us/cpp/cppcx/crt-functions-not-supported-in-universal-windows-platform-apps<https://learn.microsoft.com/en-us/cpp/cppcx/crt-functions-not-supported-in-universal-windows-platform-apps?view=msvc-170> _______________________________________________ Mingw-w64-public mailing list Mingw-w64-public@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/mingw-w64-public