Eryk Sun <[email protected]> added the comment:
To be compatible with Windows 7, _io__WindowsConsoleIO_write_impl in
Modules/_io/winconsoleio.c is forced to write to the console in chunks that do
not exceed 32 KiB. It does so by repeatedly dividing the length to decode by 2
until the decoded buffer size is small enough.
wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
while (wlen > 32766 / sizeof(wchar_t)) {
len /= 2;
wlen = MultiByteToWideChar(CP_UTF8, 0, b->buf, len, NULL, 0);
}
With `('é' * 40 + '\n') * 473`, encoded as UTF-8, we have 473 82-byte lines
(note that "\n" has been translated to "\r\n"). This is 38,786 bytes, which is
too much for a single write, so it splits it in two.
>>> 38786 // 2
19393
>>> 19393 // 82
236
>>> 19393 % 82
41
This means line 237 ends up with 20 'é' characters (UTF-8 b'\xc3\xa9') and one
partial character sequjence, b'\xc3'. When this buffer is passed to
MultiByteToWideChar to decode from UTF-8 to UTF-16, the partial sequence gets
decoded as the replacement character U+FFFD. For the next write, the remaining
b'\xa9' byte also gets decoded as U+FFFD.
To avoid this, _io__WindowsConsoleIO_write_impl could decode the whole buffer
in one pass, and slice that up into writes that are less than 32 KiB. Or it
could ensure that its UTF-8 slices are always at character boundaries.
----------
components: +IO
nosy: +eryksun
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue37871>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com