> cmd.exe can use cp65001 aka utf8??? CMD is a Unicode application that for the most part uses WinAPI wide-character functions, including the console API functions (as does Python 3.6+). There are a few exceptions. CMD uses the console codepage when decoding batch files (line by line, so you can change the codepage in the middle of a batch script), when writing output from its internal commands (e.g. dir) to pipes and files (the /u option overrides this), and when reading output from programs in a `FOR /F` loop.
> Why does cmd.exe still use cp850? In the above cases CMD uses the active console input or output codepage, which defaults to the system locale's OEM codepage. If it's not attached to a console (i.e. when run as a DETACHED_PROCESS), CMD uses the ANSI codepage in these cases. Anyway, you appear to be talking about the Windows console, which people often confuse with CMD. Programs that use command-line interfaces (CLIs) and text user interfaces (TUIs), such as classic system shells, are clients of a given console or terminal interface. A TUI application typically is tightly integrated with the console or terminal interface (e.g. a curses application), while a CLI application typically just uses standard I/O (stdin, stdout, stderr). Both cmd.exe and python.exe are Windows console clients. There's nothing special about cmd.exe in this regard. Now, there are a couple of significant problems with using codepage 65001 in the Windows console. Prior to Windows 8, WriteFile and WriteConsoleA return the number of decoded wide characters written to the console, which is a bug because they're supposed to return the number of bytes written. It's not a problem so long as there's a one-to-mapping between bytes and characters in the console's output codepage. But UTF-8 can have up to 4 bytes per character. This misleads buffered writers such as C FILE streams and Python 3's io module, which in turn causes gibberish to be printed after every write of a string that includes non-ASCII characters. Prior to Windows 10, with codepage 65001, reading input from the console via ReadConsole or ReadConsoleA fails if the input has non-ASCII characters. It gets reported as a successful read of zero bytes. This causes Python to think it's at EOF, so the REPL quits (as if Ctrl+Z had been entered) and input() raises EOFError. Even in Windows 10, while the entire read doesn't fail, it's not much better. It replaces non-ASCII characters with NUL bytes. For example, in Windows 10.0.15063: >>> os.read(0, 100) abcαβγdef b'abc\x00\x00\x00def\r\n' Microsoft is gradually working on fixing UTF-8 support in the console (well, two developers are working on it). They appear to have fixed it at least for the private console APIs used by the new Linux subsystem in Windows 10: Python 3.5.2 (default, Nov 17 2016, 17:05:23) [GCC 5.4.0 20160609] on linux Type "help", "copyright", "credits" or "license" for more information. >>> import os >>> s = os.read(0, 100) abcαβγdef >>> s b'abc\xce\xb1\xce\xb2\xce\xb3def\n' >>> s.decode() 'abcαβγdef\n' Maybe it's fixed in the Windows API in an upcoming update. But still, there are a lot of Windows 7 and 8 systems out there, for which codepage 65001 in the console will remain broken. > I always thought 65001 was not a 'real' codepage, even though some locales > (e.g. Georgia) use it [1]. Codepage 65001 isn't used by any system locale as the legacy ANSI or OEM codepage. The console allows it probably because no one thought to prevent using it in the late 1990s. It has been buggy for two decades. Moodle seems to have special support for using UTF-8 with Georgian. But as far as Windows is concerned, there is no legacy codepage for Georgian. For example: import ctypes kernel32 = ctypes.WinDLL('kernel32', use_last_error=True) LD_ACP = LOCALE_IDEFAULTANSICODEPAGE = 0x00001004 acp = (ctypes.c_wchar * 6)() >>> kernel32.GetLocaleInfoEx('ka-GE', LD_ACP, acp, 6) 2 >>> acp.value '0' A value of zero here means no ANSI codepage is defined [1]: If no ANSI code page is available, only Unicode can be used for the locale. In this case, the value is CP_ACP (0). Such a locale cannot be set as the system locale. Applications that do not support Unicode do not work correctly with locales marked as "Unicode only". Georgian (ka-GE) is a Unicode-only locale [2] that cannot be set as the system locale. [1]: https://msdn.microsoft.com/en-us/library/dd373761 [2]: https://msdn.microsoft.com/en-us/library/ms930130.aspx _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor