On Sun, Mar 13, 2016 at 3:14 AM, Albert-Jan Roskam <sjeik_ap...@hotmail.com> wrote: > I thought that utf-8 (cp65001) is by definition (or by design?) impossible > for console output in windows? Aren't there "w" (wide) versions of functions > that do accept utf-8?
The wide-character API works with the native Windows character encoding, UTF-16. Except the console is a bit 'special'. A surrogate pair (e.g. a non-BMP emoji) appears as 2 box characters, but you can copy it from the console to a rich text application, and it renders normally. The console also doesn't support variable-width fonts for mixing narrow and wide (East Asian) glyphs on the same screen. If that matters, there's a program called ConEmu that hides the console and proxies its screen and input buffers to drive an improved interface that has flexible font support, ANSI/VT100 terminal emulation, and tabs. If you pair that with win_unicode_console, it's almost as good as a Linux terminal, but the number of hoops you have to go through to make it all work is too complicated. Some people try to use UTF-8 (codepage 65001) in the ANSI API -- ReadConsoleA/ReadFile and WriteConsoleA/WriteFile. But the console's UTF-8 support is dysfunctional. It's not designed to handle it. In Windows 7, WriteFile calls WriteConsoleA, which decodes the buffer to UTF-16 using the current codepage and returns the number of UTF-16 'characters' written instead of the number of bytes. This confuses buffered writers. Say it writes a 20-byte UTF-8 string with 2 bytes per character. WriteFile returns that it successfully wrote 10 characters, so the buffered writer tries to write the last 10 bytes again. This leads to a trail of garbage text written after every write. When a program reads from the console using ReadFile or ReadConsoleA, the console's input buffer has to be encoded to the target codepage. It assumes that an ANSI character is 1 byte, so if you try to read N bytes, it tries to encode N characters. This fails for non-ASCII UTF-8, which has 2 to 4 bytes per character. However, it won't decrease the number of characters to fit in the N byte buffer. In the API the argument is named "nNumberOfCharsToRead", and they're sticking to that literally. The result is that 0 bytes are read, which is interpreted as EOF. So the REPL will quit, and input() will raise EOFError. _______________________________________________ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: https://mail.python.org/mailman/listinfo/tutor