Mark Dickinson <[email protected]> added the comment:
> Why so?
Python's hash needs to obey the invariant that `hash(x) == hash(y)` for any two
hashable objects with `x == y`. That makes life particularly hard for numeric
types, because there are a number of *different* numeric types where equality
can hold across those types. This includes not just built-in types, but third
party types as well (think of NumPy, gmpy2, SymPy, and other libraries
providing numbers that need to compare equal to Python numbers with the same
value).
So for example, `hash(1.5)`, `hash(Decimal("1.5"))`, `hash(Fraction(3, 2))`,
`hash(1.5 + 0j)`, `hash(numpy.float32(1.5))`, `hash(bigfloat.BigFloat(1.5,
precision=200))` must _all_ be equal to one another within a single running
Python process.
Moreover, hash computation needs to be efficient for common types like floats
and integers, while also not being impossibly slow for other types. (Early
versions of Decimal's hash did a check to see whether the Decimal instance was
an exact integer, and if so, converted that Decimal instance to an integer
before taking its hash. But doing that with `Decimal(1e999999)` doesn't go
well.)
It would definitely be *possible* to:
- compute a hash in a cross-type-compatible way
- do some sort of uniform post-processing of that hash, incorporating
information from a per-process random salt
The operations described by Melissa O'Neill in her PCG paper give ideas for
some ways to do such post-processing: regard the hash and the salt as together
forming a 128-bit integer, and then collapse that 128-integer down to a 64-bit
integer using one of the PCG post-processing methods. Note that as far as I
know there's been no validation of those methods from a cryptographic (rather
than a statistical) perspective.
However, it would be significant work, be disruptive not just to CPython, but
to 3rd party packages and to other Python implementations, would slow down
common hashing operations, and would increase the amount and the complexity of
code that has to be maintained into the future.
So there's no shortage of reasons *not* to change the numeric hash. What I
think we're lacking is a single reason *to* change it. Can you give a plausible
example of a situation where the predictability of the numeric hash can lead to
possible security issues?
See also the recent issue #37807.
> but *not* if my keys, say, are tuples of strings
Bad example. :-) The hash of a tuple is based on the hash of its contents. So
if those contents are strings, the tuple benefits from the string hash
randomization.
mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))"
-824966383135019564
mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))"
-5971473524922642515
mirzakhani:~ mdickinson$ python -c "print(hash(('abc', 'def')))"
5384650403450490974
----------
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue29535>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com