New submission from Merlijn van Deen:
Bugzilla sends e-mail in a format where =?UTF-8 is not preceded by whitespace.
This makes email.headerregistry.UnstructuredHeader (and
email._header_value_parser on the background) not recognise the structure.
>>> import email.headerregistry, pprint
>>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug
>>> 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian
>>> text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x);
>>> pprint.pprint(x)
{'decoded': '[Bug 64155]\tNew:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\t'
'russian text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94',
'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'),
WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'),
ValueTerminal('New:=?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;'),
WhiteSpaceTerminal('\t'), ValueTerminal('russian'), WhiteSpaceTerminal(' '),
ValueTerminal('text:=?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94')])}
versus
>>> x = {}; email.headerregistry.UnstructuredHeader.parse('[Bug 64155]\tNew:
>>> =?UTF-8?Q?=20non=2Dascii=20bug=20t=C3=A9st?=;\trussian text:
>>> =?UTF-8?Q?=20=D0=90=D0=91=D0=92=D0=93=D2=90=D0=94', x); pprint.pprint(x)
{'decoded': '[Bug 64155]\tNew: non-ascii bug tést;\trussian text: АБВГҐД',
'parse_tree': UnstructuredTokenList([ValueTerminal('[Bug'),
WhiteSpaceTerminal(' '), ValueTerminal('64155]'), WhiteSpaceTerminal('\t'),
ValueTerminal('New:'), WhiteSpaceTerminal(' '),
EncodedWord([WhiteSpaceTerminal(' '), ValueTerminal('non-ascii'),
WhiteSpaceTerminal(' '), ValueTerminal('bug'), WhiteSpaceTerminal(' '),
ValueTerminal('tést')]), ValueTerminal(';'), WhiteSpaceTerminal('\t'),
ValueTerminal('russian'), WhiteSpaceTerminal(' '), ValueTerminal('text:'),
WhiteSpaceTerminal(' '), EncodedWord([WhiteSpaceTerminal(' '),
ValueTerminal('АБВГҐД')])])}
I have attached the raw e-mail as attachment.
Judging by the code, this is supposed to work (while raising a Defect --
"missing whitespace before encoded word"), but the code splits by whitespace:
tok, *remainder = _wsp_splitter(value, 1)
which swallows the encoded section in one go. In a second attachment, I added a
patch which 1) adds a test case for this and 2) implements a solution, but the
solution is unfortunately not in the style of the rest of the module.
In the meanwhile, I've chosen a monkey-patching approach to work around the
issue:
import email._header_value_parser, email.headerregistry
def get_unstructured(value):
value = value.replace("=?UTF-8?Q?=20", " =?UTF-8?Q?")
return email._header_value_parser.get_unstructured(value)
email.headerregistry.UnstructuredHeader.value_parser =
staticmethod(get_unstructured)
----------
components: email
files: 000359.raw
messages: 216908
nosy: barry, r.david.murray, valhallasw
priority: normal
severity: normal
status: open
title: email._header_value_parser does not recognise in-line encoding changes
versions: Python 3.3, Python 3.4, Python 3.5
Added file: http://bugs.python.org/file34984/000359.raw
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue21315>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com