Sorry for the mojibake. I've not yet gotten around to actually using the email package to write a smarter replacement for nmh, which is what I use for email, and I always forget that I need to manually tell nmh when there non-ascii in the message...
On Wed, 17 Sep 2014 03:02:33 -0400, "R. David Murray" <rdmur...@bitdance.com> wrote: > On Wed, 17 Sep 2014 14:42:56 +1000, Steven D'Aprano <st...@pearwood.info> > wrote: > > On Wed, Sep 17, 2014 at 11:14:15AM +1000, Chris Angelico wrote: > > > On Wed, Sep 17, 2014 at 5:29 AM, R. David Murray <rdmur...@bitdance.com> > > > wrote: > > > > > > Basically, we are pretending that the each smuggled > > > > byte is single character for string parsing purposes...but they don't > > > > match any of our parsing constants. They are all "any character" > > > > matches > > > > in the regexes and what have you. > > > > > > This is slightly iffy, as you can't be sure that one byte represents > > > one character, but as long as you don't much care about that, it's not > > > going to be an issue. > > > > This discussion would probably be a lot more easy to follow, with fewer > > miscommunications, if there were some examples. Here is my example, > > perhaps someone can tell me if I'm understanding it correctly. > > > > I want to send an email including the header line: > > > > 'Subject: âNOBODY expects the Spanish Inquisition!â' > > > > Note the curly quotes. I've read the manifesto "UTF-8 Everywhere" so I > > do the right thing and encode it as UTF-8: > > > > b'Subject: \xe2\x80\x9cNOBODY expects the Spanish Inquisition!\xe2\x80\x9d' > > That won't work until email supports RFC 6532. Until then, you can only > use ascii and encoded words successfully. So just having the curly > quotes is a buggy enough program. > > > but it's not up to Python's email package to throw those invalid bytes > > out or permantly replace them with something else. Also, we want to work > > with Unicode strings, not byte strings, so there has to be a way to > > smuggle those three bytes into Unicode, without ending up with either > > the replacement bytes: > > > > # using the 'replace' error handler > > 'Subject: ���NOBODY expects the Spanish Inquisition!â' > > What you'll get if you request a text copy of that header is > > 'Subject: ���NOBODY expects the Spanish Inquisition!���' > > > Am I right so far? > > > > So the email package uses the surrogate-escape error handler and ends up > > with this Unicode string: > > > > 'Subject: \udc9c\udc80\udce2NOBODY expects the Spanish Inquisition!â' > > Except that it encodes the closing quote, too :) > > > which can be encoded back to the bytes we started with. > > Right. If you serialize the message as bytes, the bytes are recovered > and output when that header is output. > > Now, once we support RFC 6532, you will be exactly right, as we will > then have the option of handling utf-8 encoded headers, and we will do > that using the utf-8 codec to ingest headers, and the surrogateescape > error handler to handle exactly the kind of bad data you postulate. > > --David > _______________________________________________ > Python-Dev mailing list > Python-Dev@python.org > https://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: > https://mail.python.org/mailman/options/python-dev/rdmurray%40bitdance.com _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com