Bug#992462: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 32: invalid continuation byte

Michael Grant Wed, 18 Aug 2021 18:57:25 -0700

Package: libpython3.9-minimal
Version: 3.9.2-1
Severity: important

Dear Maintainer,


*** Reporter, please consider answering these questions, where appropriate ***

   * What led up to the situation?
   * What exactly did you do (or not do) that was effective (or
     ineffective)?
   * What was the outcome of this action?
   * What outcome did you expect instead?

*** End of the template - remove these template lines ***

While running getmail which calls this library to download my spam
folder from a gmail acct for further processing, I ran across error in
header.py.  It's triggered when a message contains an invalid unicode
sequence.  For example:

b'Body Revolution - Medico Postura\xe2"\xa2 Body Posture Corrector'

Note the double-quote (") in the middle of the unicode sequence!  This
triggers the following condition:

Exception: please read docs/BUGS and include the following information
in any bug report:

  getmail version 6.14
    Python version 3.9.2 (default, Feb 28 2021, 17:03:44)
    [GCC 10.2.1 20210110]

Unhandled exception follows:
    File "/usr/bin/getmail", line 932, in main
        success = go(configs, options.idle)
            File "/usr/bin/getmail", line 244, in go
                msg = mail_filter.filter_message(msg, retriever)
                    File
    "/usr/lib/python3/dist-packages/getmailcore/filters.py", line 79,
    in filter_message
        exitcode, newmsg, err = self._filter_message(msg)
            File
    "/usr/lib/python3/dist-packages/getmailcore/filters.py", line 289,
    in _filter_message
        msg.add_header('X-getmail-filter-classifier', line)
            File
    "/usr/lib/python3/dist-packages/getmailcore/message.py", line 210,
    in add_header
        self.__msg[name] = Header(content.rstrip(), 'utf-8')
            File "/usr/lib/python3.9/email/header.py", line 217, in
    __init__
        self.append(s, charset, errors)
            File "/usr/lib/python3.9/email/header.py", line 295, in
    append
        s = s.decode(input_charset, errors)
          UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in
    position 32: invalid continuation byte

Please also include configuration information from running getmail
with your normal options plus "--dump".

The code looks like this:

        if not isinstance(s, str):
            input_charset = charset.input_codec or 'us-ascii'
            if input_charset == _charset.UNKNOWN8BIT:
                s = s.decode('us-ascii', 'surrogateescape')
            else:
                s = s.decode(input_charset, errors)

I think you may need a try/accept around that last s.decode()
function or something to catch this case where it’s invalid utf-8.  I
don't think this should fail like this.  If it's not valid unicode
then probably it should default it back to latin-1.  I can't think of
anything better.

-- System Information:
Debian Release: 11.0
  APT prefers stable-security
  APT policy: (500, 'stable-security'), (500, 'stable'), (250, 'testing'), (10, 
'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)

Kernel: Linux 5.10.0-8-amd64 (SMP w/2 CPU threads)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) (ignored: LC_ALL 
set to en_US.UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages libpython3.9-minimal depends on:
ii  libc6      2.31-13
ii  libssl1.1  1.1.1k-1

Versions of packages libpython3.9-minimal recommends:
ii  libpython3.9-stdlib  3.9.2-1

libpython3.9-minimal suggests no packages.

-- no debconf information

Bug#992462: UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 32: invalid continuation byte

Reply via email to