New submission from Dylan Houlihan :
Currently, the `base64` method `b16decode` does not decode a hexadecimal string
with lowercase characters by default. To do so requires passing `casefold=True`
as the second argument. I propose a change to the `b16decode` method to allow
it to accept hexadecimal strings containing lowercase characters without
requiring the `casefold` argument.
The revision itself is straightforward. We simply have to amend the regular
expression to match the lowercase characters a-f in addition to A-F. Likewise
the corresponding tests in Lib/base64.py also need to be changed to account for
the lack of a second argument. Therefore there are two files total which need
to be refactored.
In my view, there are several compelling reasons for this change:
1. There is a nontrivial performance improvement. I've already made the changes
on my own test branch[1] and I see a mean decoding performance improvement of
approximately 9.4% (tested by taking the average of 1,000,000 hexadecimal
string encodings). The testing details are included in a file attached to this
issue.
2. Hexadecimal strings are case insensitive, i.e. 8DEF is equivalent to 8def.
This is the particularly motivating reason why I've written the patch - there
have been many times when I've been momentarily confounded by a hexadecimal
string that won't decode only to realize I'm yet again passing in a lowercase
string.
3. The behavior of the underlying method in `binascii`, `unhexlify`, accepts
both uppercase and lowercase characters by default without requiring a second
parameter. From the perspective of code hygiene and language consistency, I
think it's both more practical and more elegant for the language to behave in
the same, predictable manner (particularly because `base64.py` simply calls
`binascii.c` under the hood). Additionally, the `binascii` method `hexlify`
actually outputs strings in lowercase encoding, meaning that any use of both
`binascii` and `base64` in the same application will have to consistently do a
`casefold` conversion if output from `binascii.hexlify` is fed back as input to
`base64.b16decode` for some reason.
There are two arguments against this patch, as far as I can see it:
1. In the relevant IETF reference documentation (RFC3548[2], referenced
directly in the `b16decode` docstring; and RFC4648[3] with supersedes it),
under Security Considerations the author Simon Josefsson claims that there
exists a potential side channel security issue intrinsic to accepting case
insensitive hexadecimal strings in a decoding function. While I'm not
dismissing this out of hand, I personally do not find the claimed vulnerability
compelling, and Josefsson does not clarify a real world attack scenario or
threat model. I think it's important we challenge this assumption in light of
the potential nontrivial improvements to both language consistency and
performance. I would be very interested in hearing a real threat model here
that would practically exist outside of a very contrived scenario. Moreover if
this is such a security issue, why is the behavior already evident in
`binascii.unhexlify`?
2. The other reason may be that there's simply no reason to make such a change.
An argument can be put forward that a developer won't frequently have to deal
with this issue because the opposite method, `b16encode`, produces hexadecimal
strings with uppercase characters. However, in my experience hexadecimal
strings with lowercase characters are extremely common in situations where
developers haven't produced the strings themselves in the language.
As I mentioned, I have already written the changes on my own patch branch. I'll
open a pull request once this issue has been created and reference the issue in
the pull request on GitHub.
References:
1. https://github.com/djhoulihan/cpython/tree/base64_case_sensitivity
2. https://tools.ietf.org/html/rfc3548
3. https://tools.ietf.org/html/rfc4648
--
components: Library (Lib)
files: testing_data.txt
messages: 332319
nosy: djhoulihan
priority: normal
severity: normal
status: open
title: Allow lowercase hexadecimal characters in base64.b16decode()
type: performance
versions: Python 3.4, Python 3.5, Python 3.6, Python 3.7, Python 3.8
Added file: https://bugs.python.org/file48013/testing_data.txt
___
Python tracker
<https://bugs.python.org/issue35557>
___
___
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com