Re: Newbie question about text encoding

Steven D'Aprano Sun, 08 Mar 2015 11:32:50 -0700

Marko Rauhamaa wrote:

> Chris Angelico <[email protected]>:
> 
>> Once again, you appear to be surprised that invalid data is failing.
>> Why is this so strange? U+DD00 is not a valid character.


But it is a valid non-character code point.


>> It is quite correct to throw this error.
> 
> '\udd00' is a valid str object:

Is it though? Perhaps the bug is not UTF-8's inability to encode lone
surrogates, but that Python allows you to create lone surrogates in the
first place. That's not a rhetorical question. It's a genuine question.


>    >>> '\udd00'
>    '\udd00'
>    >>> '\udd00'.encode('utf-32')
>    b'\xff\xfe\x00\x00\x00\xdd\x00\x00'
>    >>> '\udd00'.encode('utf-16')
>    b'\xff\xfe\x00\xdd'

If you explicitly specify the endianness (say, utf-16-be or -le) then you
don't get the BOMs.

> I was simply stating that UTF-8 is not a bijection between unicode
> strings and octet strings (even forgetting Python). Enriching Unicode
> with 128 surrogates (U+DC80..U+DCFF) establishes a bijection, but not
> without side effects.



-- 
Steven

-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Newbie question about text encoding

Reply via email to