On Thu, 2016-03-03 at 18:55 +0100, Julien Cristau wrote: > # HG changeset patch > # User Julien Cristau <julien.cris...@logilab.fr> > # Date 1457026459 -3600 > # Thu Mar 03 18:34:19 2016 +0100 > # Node ID 6c153cbad4a032861417dbba9d1d90332964ab5f > # Parent 549ff28a345f595cad7e06fb08c2ac6973e2f030 > patch: when importing from email, RFC2047-decode From/Subject headers > > I'm not too sure about the Subject part: it should be possible to use > the charset information from the email (RFC2047 encoding and the > Content-Type header), but mercurial seems to use its own encoding > instead (in the test, that means the commit message ends up as "????" > if the import is done without --encoding utf-8). Advice welcome. > > Reported at https://bugs.debian.org/737498
You should probably immediately relay such reports upstream. > diff --git a/mercurial/patch.py b/mercurial/patch.py > --- a/mercurial/patch.py > +++ b/mercurial/patch.py > @@ -201,19 +201,28 @@ def extract(ui, fileobj): > # (this heuristic is borrowed from quilt) > diffre = re.compile(r'^(?:Index:[ \t]|diff[ \t]|RCS file: |' > r'retrieving revision [0-9]+(\.[0-9]+)*$|' > r'---[ \t].*?^\+\+\+[ \t]|' > r'\*\*\*[ \t].*?^---[ \t])', re.MULTILINE|re.DOTALL) > + def decode_header(header): FYI, names with underbars are against our coding convention, contrib/check- commit ought to warn about this. > + if header is None: > + return None > + parts = [] > + for part, charset in email.Header.decode_header(header): > + if charset is None: > + charset = 'ascii' This will almost certainly explode on some emails. We should probably do something like this: - attempt to decode based on header garbage - attempt to decode with UTF-8 - assume Latin-1 (not ascii) > + parts.append(part.decode(charset)) > + return encoding.tolocal(u' '.join(parts).encode('utf-8')) Using Unicode objects outside of encoding.py is strongly discouraged. If you must, it'd be great to unambiguously mark them all with a leading u on the variable name. This isn't a good fit for encoding.py since it uses a third encoding besides UTF-8 and local. Probably belongs in mail.py. -- Mathematics is the supreme nostalgia of our time.