[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset

2013-06-20 Thread Marko Lalic

New submission from Marko Lalic:

When the message's Content-Transfer-Encoding is set to 8bit, the 
get_payload(decode=True) method returns the payload encoded using 
raw-unicode-escape. This means that it is impossible to decode the returned 
bytes using the content charset obtained by the get_content_charset method.

It seems this should be fixed so that get_payload returns the bytes as found in 
the payload when Content-Transfer-Encoding is 8bit, exactly like Python2.7 
handles it.

>>> from email import message_from_string
>>> message = message_from_string("""MIME-Version: 1.0
... Content-Type: text/plain; charset=utf-8
... Content-Disposition: inline
... Content-Transfer-Encoding: 8bit
... 
... ünicöde data..""")
>>> message.get_content_charset()
'utf-8'
>>> message.get_payload(decode=True)
b'\xfcnic\xf6de data..'
>>> message.get_payload(decode=True).decode(message.get_content_charset())
Traceback (most recent call last):
  File "", line 1, in 
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 0: invalid 
start byte
>>> message.get_payload(decode=True).decode('raw-unicode-escape')
'ünicöde data..'

--
components: email
messages: 191526
nosy: barry, mlalic, r.david.murray
priority: normal
severity: normal
status: open
title: get_payload method returns bytes which cannot be decoded using the 
message's charset
type: behavior
versions: Python 3.3

___
Python tracker 
<http://bugs.python.org/issue18271>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset

2013-06-20 Thread Marko Lalic

Marko Lalic added the comment:

That will work fine as long as the characters are actually latin. We cannot 
forget the rest of the unicode character planes. Consider::

>>> message = message_from_string("""MIME-Version: 1.0
... Content-Type: text/plain; charset=utf-8
... Content-Disposition: inline
... Content-Transfer-Encoding: 8bit
... 
... 한글ᥡ╥ສए""")
>>> message.get_payload(decode=True).decode('latin1')
'\\ud55c\\uae00\\u1961\\u2565\\u0eaa\\u090f'
>>> message.get_payload(decode=True).decode('raw-unicode-escape')
'한글ᥡ╥ສए'

However, even if latin1 did work, the main point is that a different encoding 
than the one the message specifies must be used in order to decode the bytes to 
a unicode string.

--

___
Python tracker 
<http://bugs.python.org/issue18271>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue18271] get_payload method returns bytes which cannot be decoded using the message's charset

2013-06-20 Thread Marko Lalic

Marko Lalic added the comment:

Thank you for your reply.

Unfortunately, I have a use case where message_from_bytes has a pretty great 
disadvantage. I have to parse the received message and then forward it 
completely unchanged, apart from possibly adding a few new headers. The problem 
with message_from_bytes is that it changes the Content-Transfer-Encoding header 
to base64 (and consequently base64 encodes the content).

Do you possibly have a suggestion how to currently go about solving this 
problem? A possible solution I can spot from your answer is to check the 
Content-Transfer-Encoding before getting the payload and use the version 
without decode=True when it is 8bit. Maybe there is something more elegant?

Thank you in advance.

--

___
Python tracker 
<http://bugs.python.org/issue18271>
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com