date:20120610

Re: [Cython] Hash-based vtables

2012-06-10 Thread Robert Bradshaw

On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
 wrote:
> On 06/09/2012 03:21 AM, Robert Bradshaw wrote:
>>
>> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn
>>> There's still the indirection through SEP 200 (extensibletype slots). We
>>> can
>>> get rid of that very easily by just making that table and the hash-vtable
>>> one and the same. (It could still either have interned string keys or ID
>>> keys depending on the least significant bit.)
>>
>>
>> Or we can even forgo the interning for this table, and give up on
>> partitioning the space numerically and just use the dns-style
>> prefixing, e.g. "org.cython.X" belongs to us.
>
>
> Huh? Isn't that when you *need* interning? Do you plan on key-encoding those
> kind of strings into 64 bits?

No, use 64-bits of a a cryptographically-secure hash.

> (I think it would usually be "method:foo:ii->d" (or my current preference is
> "method:foo:i4i8->f8"))

Yeah, I was assuming methods wouldn't be specific to Cython. (Putting
sizes in the format makes a lot of sense for persistent storage, but I
think it's safe to assume that a"long in the provider == a long in the
consumer, and this would mean the hashes would have to be computed
after (some) C compilation).

> Partitioning the space numerically you'd just hash the number; "SEP 260: We
> use id 0x70040001, which has lower-64-md5 0xfa454a...ULL".

But why bother with the id?

>> There is value in the double indirection if this (or any of the other)
>> lookup tables are meant to be modified over time.
>
>
> This isn't impossible with a hash table either. You just need to reallocate
> a little more often than what would be the case with a regular hash table,
> but not dramatically so (you need to rehash whenever the element to insert
> hashes into a "large" bin, which are rather few).
>
> I want the table to have a pointer to it, so that you can atomically swap it
> out.

I think that's worth a level of indirection.

>>> To wrap up, I think this has grown in complexity beyond the "simple SEP
>>> spec". It's at the point where you don't really want to have several
>>> libraries implementing the same simple spec, but instead use the same
>>> implementation.
>>>
>>> But I think the advantages are simply too good to give up on.
>>>
>>> So I think a viable route forward is to forget the
>>> CEP/SEP/pre-PEP-approach
>>> for now (which only works for semi-complicated ideas with simple
>>> implementations) and instead simply work more directly on a library. It
>>> would need to have a couple of different use modes:
>>
>>
>> I prefer an enhancement proposal with a spec over a library, even if
>> only a single library gets used in practice. I still think it's simple
>> enough. Basically, we have the "lookup spec" and then a CEP for
>> applying this to fast callable (agreeing on signatures, and what to do
>> with extern types) and extensible type slots.
>
>
> OK.
>
>
>>
>>>  - A Python perfect-hasher for use when generating code, with only the a
>>> string interner based on CPython dicts and extensibletype metaclass as
>>> runtime dependencies (for use in Cython). This would only add some
>>> hundred
>>> source file lines...
>>>
>>>  - A C implementation of the perfect hashing exposed through a
>>> PyPerfectHashTable_Ready(), for use in libraries written in C like
>>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C
>>> implementation of the perfect hashing.
>>>
>>> And on the distribution axis:
>>>
>>>  - Small C header-style implementation of a string interner and the
>>> extensibletype metaclass, rendezvousing through sys.modules
>>>
>>>  - As part of the rendezvous, one would always try to __import__ the
>>> *real*
>>> run-time library. So if it is available in sys.path it overrides anything
>>> bundled with other libraries. That would provide a way forward for
>>> GIL-less
>>> string interning, a Python-side library for working with these tables and
>>> inspecting them, etc.
>>
>>
>> Hmm, that's an interesting idea. I think we don't actually need
>> interning, in which case the "full" library is only needed for
>> introspection.
>
>
> You don't believe the security concern is real then? Or do you want to pay
> the cost for a 160-bit SHA1 compare everywhere?
>
> I'd love to not do interning, but I see no way around it.

No, I want to use the lower 64 bits by default, but always have the
top 96 bits around to allow using this mechanism in "secure" mode at a
slight penalty. md5 is out because there are known collisions. (Yes,
sha-1 may succumb sooner rather than later, theoretical weaknesses
have been shown, so we could look to using something else (hopefully
still shipped with Python).

> BTW, a GIL-less interning library isn't rocket science. I ran khash.h
> through a preprocessor with
>
> KHASH_MAP_INIT_STR(str_to_entry, entry_t)
>
> and the result is 180 lines of code for the hash table. Then pythread.h
> provides the thread lock, another 50 lines for the interning logic
> (intern_literal, i

Re: [Cython] Hash-based vtables

2012-06-10 Thread Dag Sverre Seljebotn



Robert Bradshaw  wrote:

>On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
> wrote:
>> On 06/09/2012 03:21 AM, Robert Bradshaw wrote:
>>>
>>> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn
 There's still the indirection through SEP 200 (extensibletype
>slots). We
 can
 get rid of that very easily by just making that table and the
>hash-vtable
 one and the same. (It could still either have interned string keys
>or ID
 keys depending on the least significant bit.)
>>>
>>>
>>> Or we can even forgo the interning for this table, and give up on
>>> partitioning the space numerically and just use the dns-style
>>> prefixing, e.g. "org.cython.X" belongs to us.
>>
>>
>> Huh? Isn't that when you *need* interning? Do you plan on
>key-encoding those
>> kind of strings into 64 bits?
>
>No, use 64-bits of a a cryptographically-secure hash.
>
>> (I think it would usually be "method:foo:ii->d" (or my current
>preference is
>> "method:foo:i4i8->f8"))
>
>Yeah, I was assuming methods wouldn't be specific to Cython. (Putting
>sizes in the format makes a lot of sense for persistent storage, but I
>think it's safe to assume that a"long in the provider == a long in the
>consumer, and this would mean the hashes would have to be computed
>after (some) C compilation).
>
>> Partitioning the space numerically you'd just hash the number; "SEP
>260: We
>> use id 0x70040001, which has lower-64-md5 0xfa454a...ULL".
>
>But why bother with the id?
>
>>> There is value in the double indirection if this (or any of the
>other)
>>> lookup tables are meant to be modified over time.
>>
>>
>> This isn't impossible with a hash table either. You just need to
>reallocate
>> a little more often than what would be the case with a regular hash
>table,
>> but not dramatically so (you need to rehash whenever the element to
>insert
>> hashes into a "large" bin, which are rather few).
>>
>> I want the table to have a pointer to it, so that you can atomically
>swap it
>> out.
>
>I think that's worth a level of indirection.
>
 To wrap up, I think this has grown in complexity beyond the "simple
>SEP
 spec". It's at the point where you don't really want to have
>several
 libraries implementing the same simple spec, but instead use the
>same
 implementation.

 But I think the advantages are simply too good to give up on.

 So I think a viable route forward is to forget the
 CEP/SEP/pre-PEP-approach
 for now (which only works for semi-complicated ideas with simple
 implementations) and instead simply work more directly on a
>library. It
 would need to have a couple of different use modes:
>>>
>>>
>>> I prefer an enhancement proposal with a spec over a library, even if
>>> only a single library gets used in practice. I still think it's
>simple
>>> enough. Basically, we have the "lookup spec" and then a CEP for
>>> applying this to fast callable (agreeing on signatures, and what to
>do
>>> with extern types) and extensible type slots.
>>
>>
>> OK.
>>
>>
>>>
  - A Python perfect-hasher for use when generating code, with only
>the a
 string interner based on CPython dicts and extensibletype metaclass
>as
 runtime dependencies (for use in Cython). This would only add some
 hundred
 source file lines...

  - A C implementation of the perfect hashing exposed through a
 PyPerfectHashTable_Ready(), for use in libraries written in C like
 NumPy/SciPy). This would need to bundle the md5 algorithm and a C
 implementation of the perfect hashing.

 And on the distribution axis:

  - Small C header-style implementation of a string interner and the
 extensibletype metaclass, rendezvousing through sys.modules

  - As part of the rendezvous, one would always try to __import__
>the
 *real*
 run-time library. So if it is available in sys.path it overrides
>anything
 bundled with other libraries. That would provide a way forward for
 GIL-less
 string interning, a Python-side library for working with these
>tables and
 inspecting them, etc.
>>>
>>>
>>> Hmm, that's an interesting idea. I think we don't actually need
>>> interning, in which case the "full" library is only needed for
>>> introspection.
>>
>>
>> You don't believe the security concern is real then? Or do you want
>to pay
>> the cost for a 160-bit SHA1 compare everywhere?
>>
>> I'd love to not do interning, but I see no way around it.
>
>No, I want to use the lower 64 bits by default, but always have the
>top 96 bits around to allow using this mechanism in "secure" mode at a
>slight penalty. md5 is out because there are known collisions. (Yes,
>sha-1 may succumb sooner rather than later, theoretical weaknesses
>have been shown, so we could look to using something else (hopefully
>still shipped with Python).

But very few users are going to know about this. What's the odds that the user 
who decide to trigger JIT-compilation with function signatures that

Re: [Cython] Hash-based vtables

2012-06-10 Thread Robert Bradshaw

On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn
 wrote:
>
>
> Robert Bradshaw  wrote:
>
>>On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
>> wrote:
>>> On 06/09/2012 03:21 AM, Robert Bradshaw wrote:

 On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn
> There's still the indirection through SEP 200 (extensibletype
>>slots). We
> can
> get rid of that very easily by just making that table and the
>>hash-vtable
> one and the same. (It could still either have interned string keys
>>or ID
> keys depending on the least significant bit.)


 Or we can even forgo the interning for this table, and give up on
 partitioning the space numerically and just use the dns-style
 prefixing, e.g. "org.cython.X" belongs to us.
>>>
>>>
>>> Huh? Isn't that when you *need* interning? Do you plan on
>>key-encoding those
>>> kind of strings into 64 bits?
>>
>>No, use 64-bits of a a cryptographically-secure hash.
>>
>>> (I think it would usually be "method:foo:ii->d" (or my current
>>preference is
>>> "method:foo:i4i8->f8"))
>>
>>Yeah, I was assuming methods wouldn't be specific to Cython. (Putting
>>sizes in the format makes a lot of sense for persistent storage, but I
>>think it's safe to assume that a"long in the provider == a long in the
>>consumer, and this would mean the hashes would have to be computed
>>after (some) C compilation).
>>
>>> Partitioning the space numerically you'd just hash the number; "SEP
>>260: We
>>> use id 0x70040001, which has lower-64-md5 0xfa454a...ULL".
>>
>>But why bother with the id?
>>
 There is value in the double indirection if this (or any of the
>>other)
 lookup tables are meant to be modified over time.
>>>
>>>
>>> This isn't impossible with a hash table either. You just need to
>>reallocate
>>> a little more often than what would be the case with a regular hash
>>table,
>>> but not dramatically so (you need to rehash whenever the element to
>>insert
>>> hashes into a "large" bin, which are rather few).
>>>
>>> I want the table to have a pointer to it, so that you can atomically
>>swap it
>>> out.
>>
>>I think that's worth a level of indirection.
>>
> To wrap up, I think this has grown in complexity beyond the "simple
>>SEP
> spec". It's at the point where you don't really want to have
>>several
> libraries implementing the same simple spec, but instead use the
>>same
> implementation.
>
> But I think the advantages are simply too good to give up on.
>
> So I think a viable route forward is to forget the
> CEP/SEP/pre-PEP-approach
> for now (which only works for semi-complicated ideas with simple
> implementations) and instead simply work more directly on a
>>library. It
> would need to have a couple of different use modes:


 I prefer an enhancement proposal with a spec over a library, even if
 only a single library gets used in practice. I still think it's
>>simple
 enough. Basically, we have the "lookup spec" and then a CEP for
 applying this to fast callable (agreeing on signatures, and what to
>>do
 with extern types) and extensible type slots.
>>>
>>>
>>> OK.
>>>
>>>

>  - A Python perfect-hasher for use when generating code, with only
>>the a
> string interner based on CPython dicts and extensibletype metaclass
>>as
> runtime dependencies (for use in Cython). This would only add some
> hundred
> source file lines...
>
>  - A C implementation of the perfect hashing exposed through a
> PyPerfectHashTable_Ready(), for use in libraries written in C like
> NumPy/SciPy). This would need to bundle the md5 algorithm and a C
> implementation of the perfect hashing.
>
> And on the distribution axis:
>
>  - Small C header-style implementation of a string interner and the
> extensibletype metaclass, rendezvousing through sys.modules
>
>  - As part of the rendezvous, one would always try to __import__
>>the
> *real*
> run-time library. So if it is available in sys.path it overrides
>>anything
> bundled with other libraries. That would provide a way forward for
> GIL-less
> string interning, a Python-side library for working with these
>>tables and
> inspecting them, etc.


 Hmm, that's an interesting idea. I think we don't actually need
 interning, in which case the "full" library is only needed for
 introspection.
>>>
>>>
>>> You don't believe the security concern is real then? Or do you want
>>to pay
>>> the cost for a 160-bit SHA1 compare everywhere?
>>>
>>> I'd love to not do interning, but I see no way around it.
>>
>>No, I want to use the lower 64 bits by default, but always have the
>>top 96 bits around to allow using this mechanism in "secure" mode at a
>>slight penalty. md5 is out because there are known collisions. (Yes,
>>sha-1 may succumb sooner rather than later, theoretical weaknesses
>>have been shown, so we could look to usi

Re: [Cython] Hash-based vtables

2012-06-10 Thread Dag Sverre Seljebotn


On 06/10/2012 09:34 AM, Robert Bradshaw wrote:

On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn
  wrote:



Robert Bradshaw  wrote:


On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
  wrote:

I'd love to not do interning, but I see no way around it.


No, I want to use the lower 64 bits by default, but always have the
top 96 bits around to allow using this mechanism in "secure" mode at a
slight penalty. md5 is out because there are known collisions. (Yes,
sha-1 may succumb sooner rather than later, theoretical weaknesses
have been shown, so we could look to using something else (hopefully
still shipped with Python).


But very few users are going to know about this. What's the odds that the user 
who decide to trigger JIT-compilation with function signatures that varies 
based on the input will know about the option and turn it on and also recompile 
all his/her C extension modules?

In practice, such an option would always stay at its default value. If we leave 
it to secure by default and start teaching it to users from the start...but 
that's a big price to pay.


Yes, it's not ideal from this perspective.


And if you *do* want to run in secure mode, it will be a lot slower than 
interning.


Are you thinking that the 64-bit interned pointer would be used as the
hash? In this case all hashtables would have to be constructed at
runtime, which means it needs to be really, really cheap (well under a
milisecond, I'm sure Sage has>1000 classes,>1 methods it imports
at startup). Also I'm not sure how the very-uneven distribution would
play out for constructing perfect hastables (perhaps it won't hurt,
there's likely to be long runs of consecutive values in some cases.


No, I'm thinking that callsites need both the 64-bit interned char* and 
the 64-bit hash of the *contents*. They use the hash to figure out the 
position, then compare by ID.


The hash is not stored in callees, it's discarded after figuring out the 
table layout.


(There was this idea that if the char* has least significant bit set, 
we'd hash it directly rather than dereference it, but let's ignore that 
for now.)


I don't think under a millisecond is unfeasible to hash smallish tables 
-- we could put the pointer through a cheap hash to create more entropy 
(for the perfect hashing, being able to select a hash function through 
the >>r is important, so you can't just use the pointer directly -- but 
there are functions cheaper than md5, e.g, in here: 
http://code.google.com/p/ulib/)


That would save us a register and make the instructions shorter in some 
places I guess...I think it's really miniscule, it's not like the effect 
of load of a global variable. But if you like this approach I can 
benchmark C-written hashtable creation and see.


Dag
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

Re: [Cython] Hash-based vtables

2012-06-10 Thread Robert Bradshaw

On Sun, Jun 10, 2012 at 1:00 AM, Dag Sverre Seljebotn
 wrote:
> On 06/10/2012 09:34 AM, Robert Bradshaw wrote:
>>
>> On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn
>>   wrote:
>>>
>>>
>>>
>>> Robert Bradshaw  wrote:
>>>
 On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
   wrote:
>
> I'd love to not do interning, but I see no way around it.


 No, I want to use the lower 64 bits by default, but always have the
 top 96 bits around to allow using this mechanism in "secure" mode at a
 slight penalty. md5 is out because there are known collisions. (Yes,
 sha-1 may succumb sooner rather than later, theoretical weaknesses
 have been shown, so we could look to using something else (hopefully
 still shipped with Python).
>>>
>>>
>>> But very few users are going to know about this. What's the odds that the
>>> user who decide to trigger JIT-compilation with function signatures that
>>> varies based on the input will know about the option and turn it on and also
>>> recompile all his/her C extension modules?
>>>
>>> In practice, such an option would always stay at its default value. If we
>>> leave it to secure by default and start teaching it to users from the
>>> start...but that's a big price to pay.
>>
>>
>> Yes, it's not ideal from this perspective.
>>
>>> And if you *do* want to run in secure mode, it will be a lot slower than
>>> interning.
>>
>>
>> Are you thinking that the 64-bit interned pointer would be used as the
>> hash? In this case all hashtables would have to be constructed at
>> runtime, which means it needs to be really, really cheap (well under a
>> milisecond, I'm sure Sage has>1000 classes,>1 methods it imports
>> at startup). Also I'm not sure how the very-uneven distribution would
>> play out for constructing perfect hastables (perhaps it won't hurt,
>> there's likely to be long runs of consecutive values in some cases.
>
>
> No, I'm thinking that callsites need both the 64-bit interned char* and the
> 64-bit hash of the *contents*. They use the hash to figure out the position,
> then compare by ID.

Ah, I missed that bit. OK, yes, that could work well.

> The hash is not stored in callees, it's discarded after figuring out the
> table layout.
>
> (There was this idea that if the char* has least significant bit set, we'd
> hash it directly rather than dereference it, but let's ignore that for now.)

(For the purpose of this discussion, it's part of the "interning" step.)

> I don't think under a millisecond is unfeasible to hash smallish tables --
> we could put the pointer through a cheap hash to create more entropy (for
> the perfect hashing, being able to select a hash function through the >>r is
> important, so you can't just use the pointer directly -- but there are
> functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/)

Just a sec, we're not hashing pointers, but the full signature itself,
right? For our hash function we need

(1) Collision free on 64-bits (for non-malicious use).
(2) Good distribution (including for short strings, which is harder to come by).
(2b) Any small subset of bits should have property (2).
(3) Ideally easy to reference (e.g. "md5" is better than "these 100
lines of C code").

Cheap runtime construction is still ideal, but much less of an issue
if hashes (and perfect tables) can be constructed at compile time,
which I think this scheme allows.

> That would save us a register and make the instructions shorter in some
> places I guess...I think it's really miniscule, it's not like the effect of
> load of a global variable. But if you like this approach I can benchmark
> C-written hashtable creation and see.

This will have value in and of itself (both the implementation and the
benchmarks).

- Robert
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

Re: [Cython] Hash-based vtables

2012-06-10 Thread Dag Sverre Seljebotn



Dag Sverre Seljebotn  wrote:

>On 06/10/2012 09:34 AM, Robert Bradshaw wrote:
>> On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn
>>   wrote:
>>>
>>>
>>> Robert Bradshaw  wrote:
>>>
 On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
   wrote:
> I'd love to not do interning, but I see no way around it.

 No, I want to use the lower 64 bits by default, but always have the
 top 96 bits around to allow using this mechanism in "secure" mode
>at a
 slight penalty. md5 is out because there are known collisions.
>(Yes,
 sha-1 may succumb sooner rather than later, theoretical weaknesses
 have been shown, so we could look to using something else
>(hopefully
 still shipped with Python).
>>>
>>> But very few users are going to know about this. What's the odds
>that the user who decide to trigger JIT-compilation with function
>signatures that varies based on the input will know about the option
>and turn it on and also recompile all his/her C extension modules?
>>>
>>> In practice, such an option would always stay at its default value.
>If we leave it to secure by default and start teaching it to users from
>the start...but that's a big price to pay.
>>
>> Yes, it's not ideal from this perspective.
>>
>>> And if you *do* want to run in secure mode, it will be a lot slower
>than interning.
>>
>> Are you thinking that the 64-bit interned pointer would be used as
>the
>> hash? In this case all hashtables would have to be constructed at
>> runtime, which means it needs to be really, really cheap (well under
>a
>> milisecond, I'm sure Sage has>1000 classes,>1 methods it imports
>> at startup). Also I'm not sure how the very-uneven distribution would
>> play out for constructing perfect hastables (perhaps it won't hurt,
>> there's likely to be long runs of consecutive values in some cases.
>
>No, I'm thinking that callsites need both the 64-bit interned char* and
>
>the 64-bit hash of the *contents*. They use the hash to figure out the 
>position, then compare by ID.
>
>The hash is not stored in callees, it's discarded after figuring out
>the 
>table layout.
>
>(There was this idea that if the char* has least significant bit set, 
>we'd hash it directly rather than dereference it, but let's ignore that
>
>for now.)
>
>I don't think under a millisecond is unfeasible to hash smallish tables
>
>-- we could put the pointer through a cheap hash to create more entropy
>
>(for the perfect hashing, being able to select a hash function through 
>the >>r is important, so you can't just use the pointer directly -- but
>
>there are functions cheaper than md5, e.g, in here: 
>http://code.google.com/p/ulib/)
>
>That would save us a register and make the instructions shorter in some
>
>places I guess...I think it's really miniscule, it's not like the
>effect 
>of load of a global variable. But if you like this approach I can 
>benchmark C-written hashtable creation and see.


I don't know what I was thinking. Tha callsite can't hash every time, and the 
pointer doesn't contain enough entropy for perfect hashing, so hashing the 
pointer has only disadvantages.

I really think the call site should have both a hash and a separate interned 
ID. And if the caller knows the entry should be there, it can skip the ID check 
and only needs the hash.

That makes the table pretty slick for non-smart callers too, it would be (id, 
flags, ptr)-entries, and callers could either do strcmp or interning, with or 
without hashing. (I realize the information would be there in your proposal 
too, but this would be slimmer).

Dag



>
>Dag
>___
>cython-devel mailing list
>cython-devel@python.org
>http://mail.python.org/mailman/listinfo/cython-devel

-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

Re: [Cython] Hash-based vtables

2012-06-10 Thread Dag Sverre Seljebotn


On 06/10/2012 10:23 AM, Robert Bradshaw wrote:

On Sun, Jun 10, 2012 at 1:00 AM, Dag Sverre Seljebotn
  wrote:

On 06/10/2012 09:34 AM, Robert Bradshaw wrote:


On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn
wrote:




Robert Bradshawwrote:


On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
wrote:


I'd love to not do interning, but I see no way around it.



No, I want to use the lower 64 bits by default, but always have the
top 96 bits around to allow using this mechanism in "secure" mode at a
slight penalty. md5 is out because there are known collisions. (Yes,
sha-1 may succumb sooner rather than later, theoretical weaknesses
have been shown, so we could look to using something else (hopefully
still shipped with Python).



But very few users are going to know about this. What's the odds that the
user who decide to trigger JIT-compilation with function signatures that
varies based on the input will know about the option and turn it on and also
recompile all his/her C extension modules?

In practice, such an option would always stay at its default value. If we
leave it to secure by default and start teaching it to users from the
start...but that's a big price to pay.



Yes, it's not ideal from this perspective.


And if you *do* want to run in secure mode, it will be a lot slower than
interning.



Are you thinking that the 64-bit interned pointer would be used as the
hash? In this case all hashtables would have to be constructed at
runtime, which means it needs to be really, really cheap (well under a
milisecond, I'm sure Sage has>1000 classes,>1 methods it imports
at startup). Also I'm not sure how the very-uneven distribution would
play out for constructing perfect hastables (perhaps it won't hurt,
there's likely to be long runs of consecutive values in some cases.



No, I'm thinking that callsites need both the 64-bit interned char* and the
64-bit hash of the *contents*. They use the hash to figure out the position,
then compare by ID.


Ah, I missed that bit. OK, yes, that could work well.


Ah, we've been talking past one another for some time then. OK, let's 
settle on that.





The hash is not stored in callees, it's discarded after figuring out the
table layout.

(There was this idea that if the char* has least significant bit set, we'd
hash it directly rather than dereference it, but let's ignore that for now.)


(For the purpose of this discussion, it's part of the "interning" step.)


I don't think under a millisecond is unfeasible to hash smallish tables --
we could put the pointer through a cheap hash to create more entropy (for
the perfect hashing, being able to select a hash function through the>>r is
important, so you can't just use the pointer directly -- but there are
functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/)


Just a sec, we're not hashing pointers, but the full signature itself,
right? For our hash function we need

(1) Collision free on 64-bits (for non-malicious use).
(2) Good distribution (including for short strings, which is harder to come by).
(2b) Any small subset of bits should have property (2).
(3) Ideally easy to reference (e.g. "md5" is better than "these 100
lines of C code").

Cheap runtime construction is still ideal, but much less of an issue
if hashes (and perfect tables) can be constructed at compile time,
which I think this scheme allows.


Yes, 64 bits of md5 then? ulib contains "100 lines of C code" for md5 
anyway, if one doesn't want to go through Python hashlib (I imagine e.g. 
hashlib might be unavailable somewhere as it relies on openssl and 
there's license war going on vs. gnutls and so on. And the md5 module is 
deprecated.).





That would save us a register and make the instructions shorter in some
places I guess...I think it's really miniscule, it's not like the effect of
load of a global variable. But if you like this approach I can benchmark
C-written hashtable creation and see.


This will have value in and of itself (both the implementation and the
benchmarks).


Will do (eventually, less spare time in coming week).

About signatures, a problem I see with following the C typing is that 
the signature "ill" wouldn't hash the same as "iii" on 32-bit Windows 
and "iqq" on 32-bit Linux, and so on. I think that would be really bad. 
"l" must be banished -- but then one might as well do "i4i8i8".


Designing a signature hash where you select between these at 
compile-time is perhaps doable but does generate a lot of code and makes 
everything complicated. I think we should just start off with hashing at 
module load time when sizes are known, and then work with heuristics 
and/or build system integration to improve on that afterwards.



Dag
___
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

Re: [Cython] Hash-based vtables

2012-06-10 Thread Robert Bradshaw

On Sun, Jun 10, 2012 at 1:43 AM, Dag Sverre Seljebotn
 wrote:
> On 06/10/2012 10:23 AM, Robert Bradshaw wrote:
>>
>> On Sun, Jun 10, 2012 at 1:00 AM, Dag Sverre Seljebotn
>>   wrote:
>>>
>>> On 06/10/2012 09:34 AM, Robert Bradshaw wrote:


 On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn
     wrote:
>
>
>
>
> Robert Bradshaw    wrote:
>
>> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn
>>     wrote:
>>>
>>>
>>> I'd love to not do interning, but I see no way around it.
>>
>>
>>
>> No, I want to use the lower 64 bits by default, but always have the
>> top 96 bits around to allow using this mechanism in "secure" mode at a
>> slight penalty. md5 is out because there are known collisions. (Yes,
>> sha-1 may succumb sooner rather than later, theoretical weaknesses
>> have been shown, so we could look to using something else (hopefully
>> still shipped with Python).
>
>
>
> But very few users are going to know about this. What's the odds that
> the
> user who decide to trigger JIT-compilation with function signatures
> that
> varies based on the input will know about the option and turn it on and
> also
> recompile all his/her C extension modules?
>
> In practice, such an option would always stay at its default value. If
> we
> leave it to secure by default and start teaching it to users from the
> start...but that's a big price to pay.



 Yes, it's not ideal from this perspective.

> And if you *do* want to run in secure mode, it will be a lot slower
> than
> interning.



 Are you thinking that the 64-bit interned pointer would be used as the
 hash? In this case all hashtables would have to be constructed at
 runtime, which means it needs to be really, really cheap (well under a
 milisecond, I'm sure Sage has>1000 classes,>1 methods it imports
 at startup). Also I'm not sure how the very-uneven distribution would
 play out for constructing perfect hastables (perhaps it won't hurt,
 there's likely to be long runs of consecutive values in some cases.
>>>
>>>
>>>
>>> No, I'm thinking that callsites need both the 64-bit interned char* and
>>> the
>>> 64-bit hash of the *contents*. They use the hash to figure out the
>>> position,
>>> then compare by ID.
>>
>>
>> Ah, I missed that bit. OK, yes, that could work well.
>
>
> Ah, we've been talking past one another for some time then. OK, let's settle
> on that.
>
>
>>
>>> The hash is not stored in callees, it's discarded after figuring out the
>>> table layout.
>>>
>>> (There was this idea that if the char* has least significant bit set,
>>> we'd
>>> hash it directly rather than dereference it, but let's ignore that for
>>> now.)
>>
>>
>> (For the purpose of this discussion, it's part of the "interning" step.)
>>
>>> I don't think under a millisecond is unfeasible to hash smallish tables
>>> --
>>> we could put the pointer through a cheap hash to create more entropy (for
>>> the perfect hashing, being able to select a hash function through the>>r
>>> is
>>> important, so you can't just use the pointer directly -- but there are
>>> functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/)
>>
>>
>> Just a sec, we're not hashing pointers, but the full signature itself,
>> right? For our hash function we need
>>
>> (1) Collision free on 64-bits (for non-malicious use).
>> (2) Good distribution (including for short strings, which is harder to
>> come by).
>> (2b) Any small subset of bits should have property (2).
>> (3) Ideally easy to reference (e.g. "md5" is better than "these 100
>> lines of C code").
>>
>> Cheap runtime construction is still ideal, but much less of an issue
>> if hashes (and perfect tables) can be constructed at compile time,
>> which I think this scheme allows.
>
>
> Yes, 64 bits of md5 then?

+1 for me.

> ulib contains "100 lines of C code" for md5
> anyway, if one doesn't want to go through Python hashlib (I imagine e.g.
> hashlib might be unavailable somewhere as it relies on openssl and there's
> license war going on vs. gnutls and so on. And the md5 module is
> deprecated.).

Just the interface, right? (hashlib should be used instead...)

>>> That would save us a register and make the instructions shorter in some
>>> places I guess...I think it's really miniscule, it's not like the effect
>>> of
>>> load of a global variable. But if you like this approach I can benchmark
>>> C-written hashtable creation and see.
>>
>>
>> This will have value in and of itself (both the implementation and the
>> benchmarks).
>
>
> Will do (eventually, less spare time in coming week).
>
> About signatures, a problem I see with following the C typing is that the
> signature "ill" wouldn't hash the same as "iii" on 32-bit Windows and "iqq"
> on 32-bit Linux, and so on. I think that would be really bad.

This i

Re: [Cython] Hash-based vtables

Re: [Cython] Hash-based vtables

Re: [Cython] Hash-based vtables

Re: [Cython] Hash-based vtables

Re: [Cython] Hash-based vtables

Re: [Cython] Hash-based vtables

Re: [Cython] Hash-based vtables

Re: [Cython] Hash-based vtables

8 matches

Site Navigation

Mail list logo

Footer information