Marko Rauhamaa wrote:
> Chris Angelico <[email protected]>:
>
>> Once again, you appear to be surprised that invalid data is failing.
>> Why is this so strange? U+DD00 is not a valid character.
But it is a valid non-character code point.
>> It is quite correct to throw this error.
>
> '\udd00' is a valid str object:
Is it though? Perhaps the bug is not UTF-8's inability to encode lone
surrogates, but that Python allows you to create lone surrogates in the
first place. That's not a rhetorical question. It's a genuine question.
> >>> '\udd00'
> '\udd00'
> >>> '\udd00'.encode('utf-32')
> b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
> >>> '\udd00'.encode('utf-16')
> b'\xff\xfe\x00\xdd'
If you explicitly specify the endianness (say, utf-16-be or -le) then you
don't get the BOMs.
> I was simply stating that UTF-8 is not a bijection between unicode
> strings and octet strings (even forgetting Python). Enriching Unicode
> with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
> without side effects.
--
Steven
--
https://mail.python.org/mailman/listinfo/python-list