Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread Frank Sievertsen

Hello,

I'd still prefer to see a randomized hash()-function (at least for 3.3).

But to protect against the attacks it would be sufficient to use
randomization for collision resolution in dicts (and sets).

What if we use a second (randomized) hash-function in case there
are many collisions in ONE lookup. This hash-function is used only
for collision resolution and is not cached.

The benefits:

* protection against the known attacks
* hash(X) stays stable and the same
* dict order is only changed when there are many collisions
* doctests will not break
* enhanced collision resolution
* RNG doesn't have to be initialized in smaller programs
* nearly no slowdown of most dicts
* second hash-function is only used for keys with higher collision-rate
* lower probability to leak secrets
* possibility to use different secrets for each dict

The drawback:

* need to add a second hash-function
* slower than using one hash-function only, when > 20 collisions
* need to add this to container-types? (if used for py3.3)
* need to expose this to the user? (if used for py3.3)
* works only for datatypes with this new function
* possible to implement without breaking ABI?

The following code is meant for explanation purpose only:

for(perturb = hash; ; perturb >>= 5) {
i = (i << 2) + i + perturb + 1;

if((collisions++) == 20) {
// perturb is already zero after 13 rounds.
// 20 collisions are rare.

// you can add && (ma_mask > 256) to make 100% sure
// that it's not used for smaller dicts.

if(Py_TYPE(key)->tp_flags & Py_TPFLAGS_HAVE_RANDOMIZED_HASH) {
// If type has a randomized hash, use this now for lookup
i = perturb = PyObject_RandomizedHash(key));
}
   .


If I got this right we could add a new function "tp_randomized_hash"
to 3.3 release. But can we also add functions in older releases, without
breaking ABI?

If not, can we implement this somehow using a flag?

FOR OLDER RELEASE < 3.3:

Py_hash_t
PyObject_RandomizedHash(PyVarObject *o) {
PyTypeObject *tp = Py_TYPE(v);
if(! (tp->tp_flags & Py_TPFLAGS_HAVE_RANDOMIZED_HASH))
return -1;

global_flags_somewhere->USE_RANDOMIZED_HASH = 1;
return (*tp->tp_hash)(v);
}

 and in unicodeobject.c: (and wherever we need randomization)

static Py_hash_t
unicode_hash(PyUnicodeObject *self)
{
Py_ssize_t len;
Py_UNICODE *p;
Py_hash_t x;
Py_hash_t prefix=0;
Py_hash_t suffix=0;

if(global_flags_somewhere->USE_RANDOMIZED_HASH) {
global_flags_somewhere->USE_RANDOMIZED_HASH = 0;
initialize_rng_if_not_already_done_and_return_seed(&prefix, 
&suffix);


. (and don't cache in this case) .


It's ugly, but if I understand this correctly, the GIL will
protect us against race-conditions, right?

Hello, internals experts: Would this work or is there a better way to do
this without breaking the ABI?

Frank
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] exception chaining

2012-01-23 Thread Łukasz Langa
Wiadomość napisana przez Ethan Furman w dniu 20 sty 2012, o godz. 21:05:The problem I have with 'raise x from None' is it puts 'from None' clear at the end of linefrom None raise SomeOtherError('etc.')Better yet:with nocontext():	raise SomeOtherError('etc.')But that's python-ideas territory ;)-- Best regards,Łukasz LangaSenior Systems Architecture EngineerIT Infrastructure DepartmentGrupa Allegro Sp. z o.o.Pomyśl o środowisku naturalnym zanim wydrukujesz tę wiadomość!Please consider the environment before printing out this e-mail.___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread Glenn Linderman

On 1/23/2012 12:53 AM, Frank Sievertsen wrote:


What if we use a second (randomized) hash-function in case there
are many collisions in ONE lookup. This hash-function is used only
for collision resolution and is not cached. 


So this sounds like SafeDict, but putting it under the covers and 
automatically converting from dict to SafeDict after a collision 
threshold has been reached.  Let's call it fallback-dict.


Compared to SafeDict as a programmer tool, fallback-dict has these benefits:

* No need to change program (or library) source to respond to an attack
* Order is preserved until the collision threshold has been reached
* Performance is preserved until the collision threshold has been reached

and costs:

* converting the dict from one hash to the other by rehashing all the keys.

Compared to always randomizing the hash, fallback-dict has these benefits:

* hash (and perfomance) is deterministic: programs running on the same 
data set will have the same performance characteristic, unless the 
collision threshold is reached for that data set.
* lower probability to leak secrets, because each attacked set/dict can 
have its own secret, randomized hash seed
* patch would not need to include RNG initialization during startup, 
lowering the impact on short-running programs.


What is not clear is how much SafeDict degrades performance when it is 
used; non-cached hashes will definitely have an impact.  I'm not sure 
whether an implementation of fallback-dict in C code, would be 
significantly faster than the implementation of SafeDict in Python, to 
know whether comparing the performance of SafeDict and dict would be 
representative of the two stages of fallback-dict performance, but 
certainly the performance cost of SafeDict would be an upper bound on 
the performance cost of fallback-dict, once conversion takes place, but 
would not measure the conversion cost.  The performance of fallback-dict 
does have to be significantly better than the performance of dict with 
collisions to be beneficial, but if the conversion cost is significant, 
triggering conversions could be an attack vector.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread Frank Sievertsen



On 23.01.2012 19:25, Glenn Linderman wrote:
So this sounds like SafeDict, but putting it under the covers and 
automatically converting from dict to SafeDict after a collision 
threshold has been reached.  Let's call it fallback-dict.


and costs:

* converting the dict from one hash to the other by rehashing all the 
keys.


That's not exactly what it does, it calls the randomized hash-function 
only for those
keys, that that didn't find a free slot after 20 collision. And it uses 
this value only for

the further collision resolution.

So the value of hash() is used for the first 20 slots, randomized_hash() 
is used

after that.

1st try: slot[i = perturb = hash];
2nd try: slot[i=i * 5 + 1 + (perturb >>= 5)]
3rd try: slot[i=i * 5 + 1 + (perturb >>= 5)]

20th try: slot[i= i * 5 + 1 + (perturb >>= 5)]
21th try: slot[i= perturb = randomized_hash(key)] < HERE
22th try: slot[i= i * 5 + 1 + (perturb >>= 5)]

This is also why there is no conversion needed. It's a
per-key/per-lookup rule.

Frank
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] exception chaining

2012-01-23 Thread Georg Brandl
Am 23.01.2012 15:49, schrieb Łukasz Langa:

[graphics]
> Pomyśl o środowisku naturalnym zanim wydrukujesz tę wiadomość!
> Please consider the environment before printing out this e-mail.

Oh please?!

Georg

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread Glenn Linderman

On 1/23/2012 10:58 AM, Frank Sievertsen wrote:



On 23.01.2012 19:25, Glenn Linderman wrote:
So this sounds like SafeDict, but putting it under the covers and 
automatically converting from dict to SafeDict after a collision 
threshold has been reached.  Let's call it fallback-dict.


and costs:

* converting the dict from one hash to the other by rehashing all the 
keys.


That's not exactly what it does, it calls the randomized hash-function 
only for those
keys, that that didn't find a free slot after 20 collision. And it 
uses this value only for

the further collision resolution.

So the value of hash() is used for the first 20 slots, 
randomized_hash() is used

after that.

1st try: slot[i = perturb = hash];
2nd try: slot[i=i * 5 + 1 + (perturb >>= 5)]
3rd try: slot[i=i * 5 + 1 + (perturb >>= 5)]

20th try: slot[i= i * 5 + 1 + (perturb >>= 5)]
21th try: slot[i= perturb = randomized_hash(key)] < HERE
22th try: slot[i= i * 5 + 1 + (perturb >>= 5)]

This is also why there is no conversion needed. It's a
per-key/per-lookup rule.

Frank


Interesting idea, and I see it would avoid conversions.  What happens if 
the data area also removed from the hash?  So you enter 20 colliding 
keys, then 20 more that get randomized, then delete the 18 of the first 
20.  Can you still find the second 20 keys? Takes two sets of probes, 
somehow?
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] exception chaining

2012-01-23 Thread Łukasz Langa

Wiadomość napisana przez Georg Brandl w dniu 23 sty 2012, o godz. 21:18:

> Am 23.01.2012 15:49, schrieb Łukasz Langa:
> 
> [graphics]
>> Pomyśl o środowisku naturalnym zanim wydrukujesz tę wiadomość!
>> Please consider the environment before printing out this e-mail.
> 
> Oh please?!

Excuse me. Corpo speak! (at least it's short)

-- 
Best regards,
Łukasz Langa
Senior Systems Architecture Engineer

IT Infrastructure Department
Grupa Allegro Sp. z o.o.
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread M.-A. Lemburg
Frank Sievertsen wrote:
> Hello,
> 
> I'd still prefer to see a randomized hash()-function (at least for 3.3).
> 
> But to protect against the attacks it would be sufficient to use
> randomization for collision resolution in dicts (and sets).
> 
> What if we use a second (randomized) hash-function in case there
> are many collisions in ONE lookup. This hash-function is used only
> for collision resolution and is not cached.

This sounds a lot like what I'm referring to as universal hash function
in the discussion on the ticket:

http://bugs.python.org/issue13703#msg150724
http://bugs.python.org/issue13703#msg150795
http://bugs.python.org/issue13703#msg151813

However, I don't like the term "random" in there. It's better to make
the approach deterministic to avoid issues with not being able
to easily reproduce Python application runs for debugging purposes.

If you find that the data is manipulated, simply incrementing the
universal hash parameter and rehashing the dict with that parameter
should be enough to solve the issue (if not, which is highly unlikely,
the dict will simply reapply the fix). No randomness needed.

BTW: I attached a demo script to the ticket which demonstrates both
types of collisions using integers.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 23 2012)
>>> Python/Zope Consulting and Support ...http://www.egenix.com/
>>> mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Counting collisions for the win

2012-01-23 Thread Janzert

On 1/23/2012 1:25 PM, Glenn Linderman wrote:

On 1/23/2012 12:53 AM, Frank Sievertsen wrote:


What if we use a second (randomized) hash-function in case there
are many collisions in ONE lookup. This hash-function is used only
for collision resolution and is not cached.


So this sounds like SafeDict, but putting it under the covers and
automatically converting from dict to SafeDict after a collision
threshold has been reached.  Let's call it fallback-dict.



If you're going to essentially switch data structures dynamically 
anyway, why not just switch to something that doesn't have n**2 worse 
case performance?


Janzert

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com