Alexander Belopolsky <[email protected]> added the comment:
> It appears this is an invalid unicode character.
> Shouldn't this be caught by decode("utf8")
It should and it is in Python 3.x:
>>> b'\xed\xa8\x80'.decode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid
continuation byte
Python 2.7 behavior seems to be a bug.
>>> '\xed\xa8\x80'.decode("utf8")
u'\uda00'
Note also the following difference:
In 3.x:
>>> b'\xed\xa8\x80'.decode("utf8", 'replace')
'��'
In 2.7:
>>> '\xed\xa8\x80'.decode("utf8", 'replace')
u'\uda00'
I am not sure this should be fixed in 2.x. Lone surrogates seem to round-trip
just fine in 2.x and there likely to be existing code that relies on this.
> Shouldn't anything generated by json.dumps be parsed by json.loads?
This on the other hand should probably be fixed by either rejecting lone
surrogates in json.dumps or accepting them in json.loads or both. The last
alternative would be consistent with the common wisdom of being conservative in
what you produce but liberal in what you accept.
----------
nosy: +belopolsky, haypo
versions: +Python 2.7
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue11489>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com