[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset
New submission from Marko Lalic: When the message's Content-Transfer-Encoding is set to 8bit, the get_payload(decode=True) method returns the payload encoded using raw-unicode-escape. This means that it is impossible to decode the returned bytes using the content charset obtained by the get_content_charset method. It seems this should be fixed so that get_payload returns the bytes as found in the payload when Content-Transfer-Encoding is 8bit, exactly like Python2.7 handles it. >>> from email import message_from_string >>> message = message_from_string("""MIME-Version: 1.0 ... Content-Type: text/plain; charset=utf-8 ... Content-Disposition: inline ... Content-Transfer-Encoding: 8bit ... ... ünicöde data..""") >>> message.get_content_charset() 'utf-8' >>> message.get_payload(decode=True) b'\xfcnic\xf6de data..' >>> message.get_payload(decode=True).decode(message.get_content_charset()) Traceback (most recent call last): File "", line 1, in UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 0: invalid start byte >>> message.get_payload(decode=True).decode('raw-unicode-escape') 'ünicöde data..' -- components: email messages: 191526 nosy: barry, mlalic, r.david.murray priority: normal severity: normal status: open title: get_payload method returns bytes which cannot be decoded using the message's charset type: behavior versions: Python 3.3 ___ Python tracker <http://bugs.python.org/issue18271> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset
Marko Lalic added the comment: That will work fine as long as the characters are actually latin. We cannot forget the rest of the unicode character planes. Consider:: >>> message = message_from_string("""MIME-Version: 1.0 ... Content-Type: text/plain; charset=utf-8 ... Content-Disposition: inline ... Content-Transfer-Encoding: 8bit ... ... 한글ᥡ╥ສए""") >>> message.get_payload(decode=True).decode('latin1') '\\ud55c\\uae00\\u1961\\u2565\\u0eaa\\u090f' >>> message.get_payload(decode=True).decode('raw-unicode-escape') '한글ᥡ╥ສए' However, even if latin1 did work, the main point is that a different encoding than the one the message specifies must be used in order to decode the bytes to a unicode string. -- ___ Python tracker <http://bugs.python.org/issue18271> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset
Marko Lalic added the comment: Thank you for your reply. Unfortunately, I have a use case where message_from_bytes has a pretty great disadvantage. I have to parse the received message and then forward it completely unchanged, apart from possibly adding a few new headers. The problem with message_from_bytes is that it changes the Content-Transfer-Encoding header to base64 (and consequently base64 encodes the content). Do you possibly have a suggestion how to currently go about solving this problem? A possible solution I can spot from your answer is to check the Content-Transfer-Encoding before getting the payload and use the version without decode=True when it is 8bit. Maybe there is something more elegant? Thank you in advance. -- ___ Python tracker <http://bugs.python.org/issue18271> ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com