Re: [Cython] Hash-based vtables
On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn wrote: > On 06/09/2012 03:21 AM, Robert Bradshaw wrote: >> >> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn >>> There's still the indirection through SEP 200 (extensibletype slots). We >>> can >>> get rid of that very easily by just making that table and the hash-vtable >>> one and the same. (It could still either have interned string keys or ID >>> keys depending on the least significant bit.) >> >> >> Or we can even forgo the interning for this table, and give up on >> partitioning the space numerically and just use the dns-style >> prefixing, e.g. "org.cython.X" belongs to us. > > > Huh? Isn't that when you *need* interning? Do you plan on key-encoding those > kind of strings into 64 bits? No, use 64-bits of a a cryptographically-secure hash. > (I think it would usually be "method:foo:ii->d" (or my current preference is > "method:foo:i4i8->f8")) Yeah, I was assuming methods wouldn't be specific to Cython. (Putting sizes in the format makes a lot of sense for persistent storage, but I think it's safe to assume that a"long in the provider == a long in the consumer, and this would mean the hashes would have to be computed after (some) C compilation). > Partitioning the space numerically you'd just hash the number; "SEP 260: We > use id 0x70040001, which has lower-64-md5 0xfa454a...ULL". But why bother with the id? >> There is value in the double indirection if this (or any of the other) >> lookup tables are meant to be modified over time. > > > This isn't impossible with a hash table either. You just need to reallocate > a little more often than what would be the case with a regular hash table, > but not dramatically so (you need to rehash whenever the element to insert > hashes into a "large" bin, which are rather few). > > I want the table to have a pointer to it, so that you can atomically swap it > out. I think that's worth a level of indirection. >>> To wrap up, I think this has grown in complexity beyond the "simple SEP >>> spec". It's at the point where you don't really want to have several >>> libraries implementing the same simple spec, but instead use the same >>> implementation. >>> >>> But I think the advantages are simply too good to give up on. >>> >>> So I think a viable route forward is to forget the >>> CEP/SEP/pre-PEP-approach >>> for now (which only works for semi-complicated ideas with simple >>> implementations) and instead simply work more directly on a library. It >>> would need to have a couple of different use modes: >> >> >> I prefer an enhancement proposal with a spec over a library, even if >> only a single library gets used in practice. I still think it's simple >> enough. Basically, we have the "lookup spec" and then a CEP for >> applying this to fast callable (agreeing on signatures, and what to do >> with extern types) and extensible type slots. > > > OK. > > >> >>> - A Python perfect-hasher for use when generating code, with only the a >>> string interner based on CPython dicts and extensibletype metaclass as >>> runtime dependencies (for use in Cython). This would only add some >>> hundred >>> source file lines... >>> >>> - A C implementation of the perfect hashing exposed through a >>> PyPerfectHashTable_Ready(), for use in libraries written in C like >>> NumPy/SciPy). This would need to bundle the md5 algorithm and a C >>> implementation of the perfect hashing. >>> >>> And on the distribution axis: >>> >>> - Small C header-style implementation of a string interner and the >>> extensibletype metaclass, rendezvousing through sys.modules >>> >>> - As part of the rendezvous, one would always try to __import__ the >>> *real* >>> run-time library. So if it is available in sys.path it overrides anything >>> bundled with other libraries. That would provide a way forward for >>> GIL-less >>> string interning, a Python-side library for working with these tables and >>> inspecting them, etc. >> >> >> Hmm, that's an interesting idea. I think we don't actually need >> interning, in which case the "full" library is only needed for >> introspection. > > > You don't believe the security concern is real then? Or do you want to pay > the cost for a 160-bit SHA1 compare everywhere? > > I'd love to not do interning, but I see no way around it. No, I want to use the lower 64 bits by default, but always have the top 96 bits around to allow using this mechanism in "secure" mode at a slight penalty. md5 is out because there are known collisions. (Yes, sha-1 may succumb sooner rather than later, theoretical weaknesses have been shown, so we could look to using something else (hopefully still shipped with Python). > BTW, a GIL-less interning library isn't rocket science. I ran khash.h > through a preprocessor with > > KHASH_MAP_INIT_STR(str_to_entry, entry_t) > > and the result is 180 lines of code for the hash table. Then pythread.h > provides the thread lock, another 50 lines for the interning logic > (intern_literal, i
Re: [Cython] Hash-based vtables
Robert Bradshaw wrote: >On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn > wrote: >> On 06/09/2012 03:21 AM, Robert Bradshaw wrote: >>> >>> On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn There's still the indirection through SEP 200 (extensibletype >slots). We can get rid of that very easily by just making that table and the >hash-vtable one and the same. (It could still either have interned string keys >or ID keys depending on the least significant bit.) >>> >>> >>> Or we can even forgo the interning for this table, and give up on >>> partitioning the space numerically and just use the dns-style >>> prefixing, e.g. "org.cython.X" belongs to us. >> >> >> Huh? Isn't that when you *need* interning? Do you plan on >key-encoding those >> kind of strings into 64 bits? > >No, use 64-bits of a a cryptographically-secure hash. > >> (I think it would usually be "method:foo:ii->d" (or my current >preference is >> "method:foo:i4i8->f8")) > >Yeah, I was assuming methods wouldn't be specific to Cython. (Putting >sizes in the format makes a lot of sense for persistent storage, but I >think it's safe to assume that a"long in the provider == a long in the >consumer, and this would mean the hashes would have to be computed >after (some) C compilation). > >> Partitioning the space numerically you'd just hash the number; "SEP >260: We >> use id 0x70040001, which has lower-64-md5 0xfa454a...ULL". > >But why bother with the id? > >>> There is value in the double indirection if this (or any of the >other) >>> lookup tables are meant to be modified over time. >> >> >> This isn't impossible with a hash table either. You just need to >reallocate >> a little more often than what would be the case with a regular hash >table, >> but not dramatically so (you need to rehash whenever the element to >insert >> hashes into a "large" bin, which are rather few). >> >> I want the table to have a pointer to it, so that you can atomically >swap it >> out. > >I think that's worth a level of indirection. > To wrap up, I think this has grown in complexity beyond the "simple >SEP spec". It's at the point where you don't really want to have >several libraries implementing the same simple spec, but instead use the >same implementation. But I think the advantages are simply too good to give up on. So I think a viable route forward is to forget the CEP/SEP/pre-PEP-approach for now (which only works for semi-complicated ideas with simple implementations) and instead simply work more directly on a >library. It would need to have a couple of different use modes: >>> >>> >>> I prefer an enhancement proposal with a spec over a library, even if >>> only a single library gets used in practice. I still think it's >simple >>> enough. Basically, we have the "lookup spec" and then a CEP for >>> applying this to fast callable (agreeing on signatures, and what to >do >>> with extern types) and extensible type slots. >> >> >> OK. >> >> >>> - A Python perfect-hasher for use when generating code, with only >the a string interner based on CPython dicts and extensibletype metaclass >as runtime dependencies (for use in Cython). This would only add some hundred source file lines... - A C implementation of the perfect hashing exposed through a PyPerfectHashTable_Ready(), for use in libraries written in C like NumPy/SciPy). This would need to bundle the md5 algorithm and a C implementation of the perfect hashing. And on the distribution axis: - Small C header-style implementation of a string interner and the extensibletype metaclass, rendezvousing through sys.modules - As part of the rendezvous, one would always try to __import__ >the *real* run-time library. So if it is available in sys.path it overrides >anything bundled with other libraries. That would provide a way forward for GIL-less string interning, a Python-side library for working with these >tables and inspecting them, etc. >>> >>> >>> Hmm, that's an interesting idea. I think we don't actually need >>> interning, in which case the "full" library is only needed for >>> introspection. >> >> >> You don't believe the security concern is real then? Or do you want >to pay >> the cost for a 160-bit SHA1 compare everywhere? >> >> I'd love to not do interning, but I see no way around it. > >No, I want to use the lower 64 bits by default, but always have the >top 96 bits around to allow using this mechanism in "secure" mode at a >slight penalty. md5 is out because there are known collisions. (Yes, >sha-1 may succumb sooner rather than later, theoretical weaknesses >have been shown, so we could look to using something else (hopefully >still shipped with Python). But very few users are going to know about this. What's the odds that the user who decide to trigger JIT-compilation with function signatures that
Re: [Cython] Hash-based vtables
On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn wrote: > > > Robert Bradshaw wrote: > >>On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn >> wrote: >>> On 06/09/2012 03:21 AM, Robert Bradshaw wrote: On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn > There's still the indirection through SEP 200 (extensibletype >>slots). We > can > get rid of that very easily by just making that table and the >>hash-vtable > one and the same. (It could still either have interned string keys >>or ID > keys depending on the least significant bit.) Or we can even forgo the interning for this table, and give up on partitioning the space numerically and just use the dns-style prefixing, e.g. "org.cython.X" belongs to us. >>> >>> >>> Huh? Isn't that when you *need* interning? Do you plan on >>key-encoding those >>> kind of strings into 64 bits? >> >>No, use 64-bits of a a cryptographically-secure hash. >> >>> (I think it would usually be "method:foo:ii->d" (or my current >>preference is >>> "method:foo:i4i8->f8")) >> >>Yeah, I was assuming methods wouldn't be specific to Cython. (Putting >>sizes in the format makes a lot of sense for persistent storage, but I >>think it's safe to assume that a"long in the provider == a long in the >>consumer, and this would mean the hashes would have to be computed >>after (some) C compilation). >> >>> Partitioning the space numerically you'd just hash the number; "SEP >>260: We >>> use id 0x70040001, which has lower-64-md5 0xfa454a...ULL". >> >>But why bother with the id? >> There is value in the double indirection if this (or any of the >>other) lookup tables are meant to be modified over time. >>> >>> >>> This isn't impossible with a hash table either. You just need to >>reallocate >>> a little more often than what would be the case with a regular hash >>table, >>> but not dramatically so (you need to rehash whenever the element to >>insert >>> hashes into a "large" bin, which are rather few). >>> >>> I want the table to have a pointer to it, so that you can atomically >>swap it >>> out. >> >>I think that's worth a level of indirection. >> > To wrap up, I think this has grown in complexity beyond the "simple >>SEP > spec". It's at the point where you don't really want to have >>several > libraries implementing the same simple spec, but instead use the >>same > implementation. > > But I think the advantages are simply too good to give up on. > > So I think a viable route forward is to forget the > CEP/SEP/pre-PEP-approach > for now (which only works for semi-complicated ideas with simple > implementations) and instead simply work more directly on a >>library. It > would need to have a couple of different use modes: I prefer an enhancement proposal with a spec over a library, even if only a single library gets used in practice. I still think it's >>simple enough. Basically, we have the "lookup spec" and then a CEP for applying this to fast callable (agreeing on signatures, and what to >>do with extern types) and extensible type slots. >>> >>> >>> OK. >>> >>> > - A Python perfect-hasher for use when generating code, with only >>the a > string interner based on CPython dicts and extensibletype metaclass >>as > runtime dependencies (for use in Cython). This would only add some > hundred > source file lines... > > - A C implementation of the perfect hashing exposed through a > PyPerfectHashTable_Ready(), for use in libraries written in C like > NumPy/SciPy). This would need to bundle the md5 algorithm and a C > implementation of the perfect hashing. > > And on the distribution axis: > > - Small C header-style implementation of a string interner and the > extensibletype metaclass, rendezvousing through sys.modules > > - As part of the rendezvous, one would always try to __import__ >>the > *real* > run-time library. So if it is available in sys.path it overrides >>anything > bundled with other libraries. That would provide a way forward for > GIL-less > string interning, a Python-side library for working with these >>tables and > inspecting them, etc. Hmm, that's an interesting idea. I think we don't actually need interning, in which case the "full" library is only needed for introspection. >>> >>> >>> You don't believe the security concern is real then? Or do you want >>to pay >>> the cost for a 160-bit SHA1 compare everywhere? >>> >>> I'd love to not do interning, but I see no way around it. >> >>No, I want to use the lower 64 bits by default, but always have the >>top 96 bits around to allow using this mechanism in "secure" mode at a >>slight penalty. md5 is out because there are known collisions. (Yes, >>sha-1 may succumb sooner rather than later, theoretical weaknesses >>have been shown, so we could look to usi
Re: [Cython] Hash-based vtables
On 06/10/2012 09:34 AM, Robert Bradshaw wrote: On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn wrote: Robert Bradshaw wrote: On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn wrote: I'd love to not do interning, but I see no way around it. No, I want to use the lower 64 bits by default, but always have the top 96 bits around to allow using this mechanism in "secure" mode at a slight penalty. md5 is out because there are known collisions. (Yes, sha-1 may succumb sooner rather than later, theoretical weaknesses have been shown, so we could look to using something else (hopefully still shipped with Python). But very few users are going to know about this. What's the odds that the user who decide to trigger JIT-compilation with function signatures that varies based on the input will know about the option and turn it on and also recompile all his/her C extension modules? In practice, such an option would always stay at its default value. If we leave it to secure by default and start teaching it to users from the start...but that's a big price to pay. Yes, it's not ideal from this perspective. And if you *do* want to run in secure mode, it will be a lot slower than interning. Are you thinking that the 64-bit interned pointer would be used as the hash? In this case all hashtables would have to be constructed at runtime, which means it needs to be really, really cheap (well under a milisecond, I'm sure Sage has>1000 classes,>1 methods it imports at startup). Also I'm not sure how the very-uneven distribution would play out for constructing perfect hastables (perhaps it won't hurt, there's likely to be long runs of consecutive values in some cases. No, I'm thinking that callsites need both the 64-bit interned char* and the 64-bit hash of the *contents*. They use the hash to figure out the position, then compare by ID. The hash is not stored in callees, it's discarded after figuring out the table layout. (There was this idea that if the char* has least significant bit set, we'd hash it directly rather than dereference it, but let's ignore that for now.) I don't think under a millisecond is unfeasible to hash smallish tables -- we could put the pointer through a cheap hash to create more entropy (for the perfect hashing, being able to select a hash function through the >>r is important, so you can't just use the pointer directly -- but there are functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/) That would save us a register and make the instructions shorter in some places I guess...I think it's really miniscule, it's not like the effect of load of a global variable. But if you like this approach I can benchmark C-written hashtable creation and see. Dag ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] Hash-based vtables
On Sun, Jun 10, 2012 at 1:00 AM, Dag Sverre Seljebotn wrote: > On 06/10/2012 09:34 AM, Robert Bradshaw wrote: >> >> On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn >> wrote: >>> >>> >>> >>> Robert Bradshaw wrote: >>> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn wrote: > > I'd love to not do interning, but I see no way around it. No, I want to use the lower 64 bits by default, but always have the top 96 bits around to allow using this mechanism in "secure" mode at a slight penalty. md5 is out because there are known collisions. (Yes, sha-1 may succumb sooner rather than later, theoretical weaknesses have been shown, so we could look to using something else (hopefully still shipped with Python). >>> >>> >>> But very few users are going to know about this. What's the odds that the >>> user who decide to trigger JIT-compilation with function signatures that >>> varies based on the input will know about the option and turn it on and also >>> recompile all his/her C extension modules? >>> >>> In practice, such an option would always stay at its default value. If we >>> leave it to secure by default and start teaching it to users from the >>> start...but that's a big price to pay. >> >> >> Yes, it's not ideal from this perspective. >> >>> And if you *do* want to run in secure mode, it will be a lot slower than >>> interning. >> >> >> Are you thinking that the 64-bit interned pointer would be used as the >> hash? In this case all hashtables would have to be constructed at >> runtime, which means it needs to be really, really cheap (well under a >> milisecond, I'm sure Sage has>1000 classes,>1 methods it imports >> at startup). Also I'm not sure how the very-uneven distribution would >> play out for constructing perfect hastables (perhaps it won't hurt, >> there's likely to be long runs of consecutive values in some cases. > > > No, I'm thinking that callsites need both the 64-bit interned char* and the > 64-bit hash of the *contents*. They use the hash to figure out the position, > then compare by ID. Ah, I missed that bit. OK, yes, that could work well. > The hash is not stored in callees, it's discarded after figuring out the > table layout. > > (There was this idea that if the char* has least significant bit set, we'd > hash it directly rather than dereference it, but let's ignore that for now.) (For the purpose of this discussion, it's part of the "interning" step.) > I don't think under a millisecond is unfeasible to hash smallish tables -- > we could put the pointer through a cheap hash to create more entropy (for > the perfect hashing, being able to select a hash function through the >>r is > important, so you can't just use the pointer directly -- but there are > functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/) Just a sec, we're not hashing pointers, but the full signature itself, right? For our hash function we need (1) Collision free on 64-bits (for non-malicious use). (2) Good distribution (including for short strings, which is harder to come by). (2b) Any small subset of bits should have property (2). (3) Ideally easy to reference (e.g. "md5" is better than "these 100 lines of C code"). Cheap runtime construction is still ideal, but much less of an issue if hashes (and perfect tables) can be constructed at compile time, which I think this scheme allows. > That would save us a register and make the instructions shorter in some > places I guess...I think it's really miniscule, it's not like the effect of > load of a global variable. But if you like this approach I can benchmark > C-written hashtable creation and see. This will have value in and of itself (both the implementation and the benchmarks). - Robert ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] Hash-based vtables
Dag Sverre Seljebotn wrote: >On 06/10/2012 09:34 AM, Robert Bradshaw wrote: >> On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn >> wrote: >>> >>> >>> Robert Bradshaw wrote: >>> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn wrote: > I'd love to not do interning, but I see no way around it. No, I want to use the lower 64 bits by default, but always have the top 96 bits around to allow using this mechanism in "secure" mode >at a slight penalty. md5 is out because there are known collisions. >(Yes, sha-1 may succumb sooner rather than later, theoretical weaknesses have been shown, so we could look to using something else >(hopefully still shipped with Python). >>> >>> But very few users are going to know about this. What's the odds >that the user who decide to trigger JIT-compilation with function >signatures that varies based on the input will know about the option >and turn it on and also recompile all his/her C extension modules? >>> >>> In practice, such an option would always stay at its default value. >If we leave it to secure by default and start teaching it to users from >the start...but that's a big price to pay. >> >> Yes, it's not ideal from this perspective. >> >>> And if you *do* want to run in secure mode, it will be a lot slower >than interning. >> >> Are you thinking that the 64-bit interned pointer would be used as >the >> hash? In this case all hashtables would have to be constructed at >> runtime, which means it needs to be really, really cheap (well under >a >> milisecond, I'm sure Sage has>1000 classes,>1 methods it imports >> at startup). Also I'm not sure how the very-uneven distribution would >> play out for constructing perfect hastables (perhaps it won't hurt, >> there's likely to be long runs of consecutive values in some cases. > >No, I'm thinking that callsites need both the 64-bit interned char* and > >the 64-bit hash of the *contents*. They use the hash to figure out the >position, then compare by ID. > >The hash is not stored in callees, it's discarded after figuring out >the >table layout. > >(There was this idea that if the char* has least significant bit set, >we'd hash it directly rather than dereference it, but let's ignore that > >for now.) > >I don't think under a millisecond is unfeasible to hash smallish tables > >-- we could put the pointer through a cheap hash to create more entropy > >(for the perfect hashing, being able to select a hash function through >the >>r is important, so you can't just use the pointer directly -- but > >there are functions cheaper than md5, e.g, in here: >http://code.google.com/p/ulib/) > >That would save us a register and make the instructions shorter in some > >places I guess...I think it's really miniscule, it's not like the >effect >of load of a global variable. But if you like this approach I can >benchmark C-written hashtable creation and see. I don't know what I was thinking. Tha callsite can't hash every time, and the pointer doesn't contain enough entropy for perfect hashing, so hashing the pointer has only disadvantages. I really think the call site should have both a hash and a separate interned ID. And if the caller knows the entry should be there, it can skip the ID check and only needs the hash. That makes the table pretty slick for non-smart callers too, it would be (id, flags, ptr)-entries, and callers could either do strcmp or interning, with or without hashing. (I realize the information would be there in your proposal too, but this would be slimmer). Dag > >Dag >___ >cython-devel mailing list >cython-devel@python.org >http://mail.python.org/mailman/listinfo/cython-devel -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] Hash-based vtables
On 06/10/2012 10:23 AM, Robert Bradshaw wrote: On Sun, Jun 10, 2012 at 1:00 AM, Dag Sverre Seljebotn wrote: On 06/10/2012 09:34 AM, Robert Bradshaw wrote: On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn wrote: Robert Bradshawwrote: On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn wrote: I'd love to not do interning, but I see no way around it. No, I want to use the lower 64 bits by default, but always have the top 96 bits around to allow using this mechanism in "secure" mode at a slight penalty. md5 is out because there are known collisions. (Yes, sha-1 may succumb sooner rather than later, theoretical weaknesses have been shown, so we could look to using something else (hopefully still shipped with Python). But very few users are going to know about this. What's the odds that the user who decide to trigger JIT-compilation with function signatures that varies based on the input will know about the option and turn it on and also recompile all his/her C extension modules? In practice, such an option would always stay at its default value. If we leave it to secure by default and start teaching it to users from the start...but that's a big price to pay. Yes, it's not ideal from this perspective. And if you *do* want to run in secure mode, it will be a lot slower than interning. Are you thinking that the 64-bit interned pointer would be used as the hash? In this case all hashtables would have to be constructed at runtime, which means it needs to be really, really cheap (well under a milisecond, I'm sure Sage has>1000 classes,>1 methods it imports at startup). Also I'm not sure how the very-uneven distribution would play out for constructing perfect hastables (perhaps it won't hurt, there's likely to be long runs of consecutive values in some cases. No, I'm thinking that callsites need both the 64-bit interned char* and the 64-bit hash of the *contents*. They use the hash to figure out the position, then compare by ID. Ah, I missed that bit. OK, yes, that could work well. Ah, we've been talking past one another for some time then. OK, let's settle on that. The hash is not stored in callees, it's discarded after figuring out the table layout. (There was this idea that if the char* has least significant bit set, we'd hash it directly rather than dereference it, but let's ignore that for now.) (For the purpose of this discussion, it's part of the "interning" step.) I don't think under a millisecond is unfeasible to hash smallish tables -- we could put the pointer through a cheap hash to create more entropy (for the perfect hashing, being able to select a hash function through the>>r is important, so you can't just use the pointer directly -- but there are functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/) Just a sec, we're not hashing pointers, but the full signature itself, right? For our hash function we need (1) Collision free on 64-bits (for non-malicious use). (2) Good distribution (including for short strings, which is harder to come by). (2b) Any small subset of bits should have property (2). (3) Ideally easy to reference (e.g. "md5" is better than "these 100 lines of C code"). Cheap runtime construction is still ideal, but much less of an issue if hashes (and perfect tables) can be constructed at compile time, which I think this scheme allows. Yes, 64 bits of md5 then? ulib contains "100 lines of C code" for md5 anyway, if one doesn't want to go through Python hashlib (I imagine e.g. hashlib might be unavailable somewhere as it relies on openssl and there's license war going on vs. gnutls and so on. And the md5 module is deprecated.). That would save us a register and make the instructions shorter in some places I guess...I think it's really miniscule, it's not like the effect of load of a global variable. But if you like this approach I can benchmark C-written hashtable creation and see. This will have value in and of itself (both the implementation and the benchmarks). Will do (eventually, less spare time in coming week). About signatures, a problem I see with following the C typing is that the signature "ill" wouldn't hash the same as "iii" on 32-bit Windows and "iqq" on 32-bit Linux, and so on. I think that would be really bad. "l" must be banished -- but then one might as well do "i4i8i8". Designing a signature hash where you select between these at compile-time is perhaps doable but does generate a lot of code and makes everything complicated. I think we should just start off with hashing at module load time when sizes are known, and then work with heuristics and/or build system integration to improve on that afterwards. Dag ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] Hash-based vtables
On Sun, Jun 10, 2012 at 1:43 AM, Dag Sverre Seljebotn wrote: > On 06/10/2012 10:23 AM, Robert Bradshaw wrote: >> >> On Sun, Jun 10, 2012 at 1:00 AM, Dag Sverre Seljebotn >> wrote: >>> >>> On 06/10/2012 09:34 AM, Robert Bradshaw wrote: On Sun, Jun 10, 2012 at 12:14 AM, Dag Sverre Seljebotn wrote: > > > > > Robert Bradshaw wrote: > >> On Fri, Jun 8, 2012 at 10:45 PM, Dag Sverre Seljebotn >> wrote: >>> >>> >>> I'd love to not do interning, but I see no way around it. >> >> >> >> No, I want to use the lower 64 bits by default, but always have the >> top 96 bits around to allow using this mechanism in "secure" mode at a >> slight penalty. md5 is out because there are known collisions. (Yes, >> sha-1 may succumb sooner rather than later, theoretical weaknesses >> have been shown, so we could look to using something else (hopefully >> still shipped with Python). > > > > But very few users are going to know about this. What's the odds that > the > user who decide to trigger JIT-compilation with function signatures > that > varies based on the input will know about the option and turn it on and > also > recompile all his/her C extension modules? > > In practice, such an option would always stay at its default value. If > we > leave it to secure by default and start teaching it to users from the > start...but that's a big price to pay. Yes, it's not ideal from this perspective. > And if you *do* want to run in secure mode, it will be a lot slower > than > interning. Are you thinking that the 64-bit interned pointer would be used as the hash? In this case all hashtables would have to be constructed at runtime, which means it needs to be really, really cheap (well under a milisecond, I'm sure Sage has>1000 classes,>1 methods it imports at startup). Also I'm not sure how the very-uneven distribution would play out for constructing perfect hastables (perhaps it won't hurt, there's likely to be long runs of consecutive values in some cases. >>> >>> >>> >>> No, I'm thinking that callsites need both the 64-bit interned char* and >>> the >>> 64-bit hash of the *contents*. They use the hash to figure out the >>> position, >>> then compare by ID. >> >> >> Ah, I missed that bit. OK, yes, that could work well. > > > Ah, we've been talking past one another for some time then. OK, let's settle > on that. > > >> >>> The hash is not stored in callees, it's discarded after figuring out the >>> table layout. >>> >>> (There was this idea that if the char* has least significant bit set, >>> we'd >>> hash it directly rather than dereference it, but let's ignore that for >>> now.) >> >> >> (For the purpose of this discussion, it's part of the "interning" step.) >> >>> I don't think under a millisecond is unfeasible to hash smallish tables >>> -- >>> we could put the pointer through a cheap hash to create more entropy (for >>> the perfect hashing, being able to select a hash function through the>>r >>> is >>> important, so you can't just use the pointer directly -- but there are >>> functions cheaper than md5, e.g, in here: http://code.google.com/p/ulib/) >> >> >> Just a sec, we're not hashing pointers, but the full signature itself, >> right? For our hash function we need >> >> (1) Collision free on 64-bits (for non-malicious use). >> (2) Good distribution (including for short strings, which is harder to >> come by). >> (2b) Any small subset of bits should have property (2). >> (3) Ideally easy to reference (e.g. "md5" is better than "these 100 >> lines of C code"). >> >> Cheap runtime construction is still ideal, but much less of an issue >> if hashes (and perfect tables) can be constructed at compile time, >> which I think this scheme allows. > > > Yes, 64 bits of md5 then? +1 for me. > ulib contains "100 lines of C code" for md5 > anyway, if one doesn't want to go through Python hashlib (I imagine e.g. > hashlib might be unavailable somewhere as it relies on openssl and there's > license war going on vs. gnutls and so on. And the md5 module is > deprecated.). Just the interface, right? (hashlib should be used instead...) >>> That would save us a register and make the instructions shorter in some >>> places I guess...I think it's really miniscule, it's not like the effect >>> of >>> load of a global variable. But if you like this approach I can benchmark >>> C-written hashtable creation and see. >> >> >> This will have value in and of itself (both the implementation and the >> benchmarks). > > > Will do (eventually, less spare time in coming week). > > About signatures, a problem I see with following the C typing is that the > signature "ill" wouldn't hash the same as "iii" on 32-bit Windows and "iqq" > on 32-bit Linux, and so on. I think that would be really bad. This i