Re: [Python-Dev] Counting collisions for the win
Hello, I'd still prefer to see a randomized hash()-function (at least for 3.3). But to protect against the attacks it would be sufficient to use randomization for collision resolution in dicts (and sets). What if we use a second (randomized) hash-function in case there are many collisions in ONE lookup. This hash-function is used only for collision resolution and is not cached. The benefits: * protection against the known attacks * hash(X) stays stable and the same * dict order is only changed when there are many collisions * doctests will not break * enhanced collision resolution * RNG doesn't have to be initialized in smaller programs * nearly no slowdown of most dicts * second hash-function is only used for keys with higher collision-rate * lower probability to leak secrets * possibility to use different secrets for each dict The drawback: * need to add a second hash-function * slower than using one hash-function only, when > 20 collisions * need to add this to container-types? (if used for py3.3) * need to expose this to the user? (if used for py3.3) * works only for datatypes with this new function * possible to implement without breaking ABI? The following code is meant for explanation purpose only: for(perturb = hash; ; perturb >>= 5) { i = (i << 2) + i + perturb + 1; if((collisions++) == 20) { // perturb is already zero after 13 rounds. // 20 collisions are rare. // you can add && (ma_mask > 256) to make 100% sure // that it's not used for smaller dicts. if(Py_TYPE(key)->tp_flags & Py_TPFLAGS_HAVE_RANDOMIZED_HASH) { // If type has a randomized hash, use this now for lookup i = perturb = PyObject_RandomizedHash(key)); } . If I got this right we could add a new function "tp_randomized_hash" to 3.3 release. But can we also add functions in older releases, without breaking ABI? If not, can we implement this somehow using a flag? FOR OLDER RELEASE < 3.3: Py_hash_t PyObject_RandomizedHash(PyVarObject *o) { PyTypeObject *tp = Py_TYPE(v); if(! (tp->tp_flags & Py_TPFLAGS_HAVE_RANDOMIZED_HASH)) return -1; global_flags_somewhere->USE_RANDOMIZED_HASH = 1; return (*tp->tp_hash)(v); } and in unicodeobject.c: (and wherever we need randomization) static Py_hash_t unicode_hash(PyUnicodeObject *self) { Py_ssize_t len; Py_UNICODE *p; Py_hash_t x; Py_hash_t prefix=0; Py_hash_t suffix=0; if(global_flags_somewhere->USE_RANDOMIZED_HASH) { global_flags_somewhere->USE_RANDOMIZED_HASH = 0; initialize_rng_if_not_already_done_and_return_seed(&prefix, &suffix); . (and don't cache in this case) . It's ugly, but if I understand this correctly, the GIL will protect us against race-conditions, right? Hello, internals experts: Would this work or is there a better way to do this without breaking the ABI? Frank ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] exception chaining
Wiadomość napisana przez Ethan Furman w dniu 20 sty 2012, o godz. 21:05:The problem I have with 'raise x from None' is it puts 'from None' clear at the end of linefrom None raise SomeOtherError('etc.')Better yet:with nocontext(): raise SomeOtherError('etc.')But that's python-ideas territory ;)-- Best regards,Łukasz LangaSenior Systems Architecture EngineerIT Infrastructure DepartmentGrupa Allegro Sp. z o.o.Pomyśl o środowisku naturalnym zanim wydrukujesz tę wiadomość!Please consider the environment before printing out this e-mail.___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Counting collisions for the win
On 1/23/2012 12:53 AM, Frank Sievertsen wrote: What if we use a second (randomized) hash-function in case there are many collisions in ONE lookup. This hash-function is used only for collision resolution and is not cached. So this sounds like SafeDict, but putting it under the covers and automatically converting from dict to SafeDict after a collision threshold has been reached. Let's call it fallback-dict. Compared to SafeDict as a programmer tool, fallback-dict has these benefits: * No need to change program (or library) source to respond to an attack * Order is preserved until the collision threshold has been reached * Performance is preserved until the collision threshold has been reached and costs: * converting the dict from one hash to the other by rehashing all the keys. Compared to always randomizing the hash, fallback-dict has these benefits: * hash (and perfomance) is deterministic: programs running on the same data set will have the same performance characteristic, unless the collision threshold is reached for that data set. * lower probability to leak secrets, because each attacked set/dict can have its own secret, randomized hash seed * patch would not need to include RNG initialization during startup, lowering the impact on short-running programs. What is not clear is how much SafeDict degrades performance when it is used; non-cached hashes will definitely have an impact. I'm not sure whether an implementation of fallback-dict in C code, would be significantly faster than the implementation of SafeDict in Python, to know whether comparing the performance of SafeDict and dict would be representative of the two stages of fallback-dict performance, but certainly the performance cost of SafeDict would be an upper bound on the performance cost of fallback-dict, once conversion takes place, but would not measure the conversion cost. The performance of fallback-dict does have to be significantly better than the performance of dict with collisions to be beneficial, but if the conversion cost is significant, triggering conversions could be an attack vector. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Counting collisions for the win
On 23.01.2012 19:25, Glenn Linderman wrote: So this sounds like SafeDict, but putting it under the covers and automatically converting from dict to SafeDict after a collision threshold has been reached. Let's call it fallback-dict. and costs: * converting the dict from one hash to the other by rehashing all the keys. That's not exactly what it does, it calls the randomized hash-function only for those keys, that that didn't find a free slot after 20 collision. And it uses this value only for the further collision resolution. So the value of hash() is used for the first 20 slots, randomized_hash() is used after that. 1st try: slot[i = perturb = hash]; 2nd try: slot[i=i * 5 + 1 + (perturb >>= 5)] 3rd try: slot[i=i * 5 + 1 + (perturb >>= 5)] 20th try: slot[i= i * 5 + 1 + (perturb >>= 5)] 21th try: slot[i= perturb = randomized_hash(key)] < HERE 22th try: slot[i= i * 5 + 1 + (perturb >>= 5)] This is also why there is no conversion needed. It's a per-key/per-lookup rule. Frank ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] exception chaining
Am 23.01.2012 15:49, schrieb Łukasz Langa: [graphics] > Pomyśl o środowisku naturalnym zanim wydrukujesz tę wiadomość! > Please consider the environment before printing out this e-mail. Oh please?! Georg ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Counting collisions for the win
On 1/23/2012 10:58 AM, Frank Sievertsen wrote: On 23.01.2012 19:25, Glenn Linderman wrote: So this sounds like SafeDict, but putting it under the covers and automatically converting from dict to SafeDict after a collision threshold has been reached. Let's call it fallback-dict. and costs: * converting the dict from one hash to the other by rehashing all the keys. That's not exactly what it does, it calls the randomized hash-function only for those keys, that that didn't find a free slot after 20 collision. And it uses this value only for the further collision resolution. So the value of hash() is used for the first 20 slots, randomized_hash() is used after that. 1st try: slot[i = perturb = hash]; 2nd try: slot[i=i * 5 + 1 + (perturb >>= 5)] 3rd try: slot[i=i * 5 + 1 + (perturb >>= 5)] 20th try: slot[i= i * 5 + 1 + (perturb >>= 5)] 21th try: slot[i= perturb = randomized_hash(key)] < HERE 22th try: slot[i= i * 5 + 1 + (perturb >>= 5)] This is also why there is no conversion needed. It's a per-key/per-lookup rule. Frank Interesting idea, and I see it would avoid conversions. What happens if the data area also removed from the hash? So you enter 20 colliding keys, then 20 more that get randomized, then delete the 18 of the first 20. Can you still find the second 20 keys? Takes two sets of probes, somehow? ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] exception chaining
Wiadomość napisana przez Georg Brandl w dniu 23 sty 2012, o godz. 21:18: > Am 23.01.2012 15:49, schrieb Łukasz Langa: > > [graphics] >> Pomyśl o środowisku naturalnym zanim wydrukujesz tę wiadomość! >> Please consider the environment before printing out this e-mail. > > Oh please?! Excuse me. Corpo speak! (at least it's short) -- Best regards, Łukasz Langa Senior Systems Architecture Engineer IT Infrastructure Department Grupa Allegro Sp. z o.o. ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Counting collisions for the win
Frank Sievertsen wrote: > Hello, > > I'd still prefer to see a randomized hash()-function (at least for 3.3). > > But to protect against the attacks it would be sufficient to use > randomization for collision resolution in dicts (and sets). > > What if we use a second (randomized) hash-function in case there > are many collisions in ONE lookup. This hash-function is used only > for collision resolution and is not cached. This sounds a lot like what I'm referring to as universal hash function in the discussion on the ticket: http://bugs.python.org/issue13703#msg150724 http://bugs.python.org/issue13703#msg150795 http://bugs.python.org/issue13703#msg151813 However, I don't like the term "random" in there. It's better to make the approach deterministic to avoid issues with not being able to easily reproduce Python application runs for debugging purposes. If you find that the data is manipulated, simply incrementing the universal hash parameter and rehashing the dict with that parameter should be enough to solve the issue (if not, which is highly unlikely, the dict will simply reapply the fix). No randomness needed. BTW: I attached a demo script to the ticket which demonstrates both types of collisions using integers. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, Jan 23 2012) >>> Python/Zope Consulting and Support ...http://www.egenix.com/ >>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ >>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Counting collisions for the win
On 1/23/2012 1:25 PM, Glenn Linderman wrote: On 1/23/2012 12:53 AM, Frank Sievertsen wrote: What if we use a second (randomized) hash-function in case there are many collisions in ONE lookup. This hash-function is used only for collision resolution and is not cached. So this sounds like SafeDict, but putting it under the covers and automatically converting from dict to SafeDict after a collision threshold has been reached. Let's call it fallback-dict. If you're going to essentially switch data structures dynamically anyway, why not just switch to something that doesn't have n**2 worse case performance? Janzert ___ Python-Dev mailing list Python-Dev@python.org http://mail.python.org/mailman/listinfo/python-dev Unsubscribe: http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com