Re: [Cython] Bug: bad C code generated for (some) "... and ... or ..." expressions
Stefan Behnel wrote at 2012-6-8 08:50 +0200: >thanks for the report. > >Dieter Maurer, 07.06.2012 10:44: >> "cython 0.13" generates bad C code for the attached "pyx" file. > >Could you try the latest release? I would at least expect an error instead >of actually generating code. The latest release on PyPI is "0.16". It behaves identical to version "0.13": no error message; just wrongly generated C code (C code containing ";" "statements". -- Dieter ___ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel
Re: [Cython] Hash-based vtables
On 06/07/2012 12:35 PM, Dag Sverre Seljebotn wrote: On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote: On 06/07/2012 12:26 AM, Robert Bradshaw wrote: On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn wrote: On 06/06/2012 11:16 PM, Robert Bradshaw wrote: On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn wrote: On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: On 06/05/2012 12:30 AM, Robert Bradshaw wrote: I just found http://cmph.sourceforge.net/ which looks quite interesting. Though the resulting hash functions are supposedly cheap, I have the feeling that branching is considered cheap in this context. Actually, this lead was *very* promising. I believe the very first reference I actually read through and didn't eliminate after the abstract totally swept away our home-grown solutions! "Hash& Displace" by Pagh (1999) is actually very simple, easy to understand, and fast both for generation and (the branch-free) lookup: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf The idea is: - Find a hash `g(x)` to partition the keys into `b` groups (the paper requires b> 2n, though I think in practice you can often get away with less) - Find a hash `f(x)` such that f is 1:1 within each group (which is easily achieved since groups only has a few elements) - For each group, from largest to smallest: Find a displacement `d[group]` so that `f(x) ^ d` doesn't cause collisions. It requires extra storage for the displacement table. However, I think 8 bits per element might suffice even for vtables of 512 or 1024 in size. Even with 16 bits it's rather negligible compared to the minimum-128-bit entries of the table. I benchmarked these hash functions: displace1: ((h>> r1) ^ d[h& 63])& m1 displace2: ((h>> r1) ^ d[h& m2])& m1 displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1 Only the third one is truly in the spirit of the algorithm, but I think the first two should work well too (and when h is known compile-time, looking up d[h& 63] isn't harder than looking up r1 or m1). My computer is acting up and all my numbers today are slower than the earlier ones (yes, I've disabled turbo-mode in the BIOS for a year ago, and yes, I've pinned the CPU speed). But here's today's numbers, compiled with -DIMHASH: direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 val=24.00 index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=18.00 twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 val=18.00 threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 val=18.00 displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 val=18.00 displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 val=18.00 displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 val=18.00 I did a dirty prototype of the table-finder as well and it works: https://github.com/dagss/hashvtable/blob/master/pagh99.py The paper obviously puts more effort on minimizing table size and not a fast lookup. My hunch is that our choice should be ((h>> table.r) ^ table.d[h& m2])& m1 and use 8-bits d (because even if you have 1024 methods, you'd rather double the number of bins than those 2 extra bits available for displacement options). Then keep incrementing the size of d and the number of table slots (in such an order that the total vtable size is minimized) until success. In practice this should almost always just increase the size of d, and keep the table size at the lowest 2**k that fits the slots (even for 64 methods or 128 methods :-)) Essentially we avoid the shift in the argument to d[] by making d larger. Nice. I'm surprised that the indirection on d doesn't cost us much; Well, table->d[const& const] compiles down to the same kind of code as table->m1. I guess I'm surprised too that displace2 doesn't penalize. hopefully its size wouldn't be a big issue either. What kinds of densities were you achieving? OK, simulation results just in (for the displace2 hash), and they exceeded my expectations. I always fill the table with n=2^k keys, and fix b = n (b means |d|). Then the failure rates are (top two are 100,000 simulations, the rest are 1000 simulations): n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65 n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65 n= 32 b= 32 failure-rate=0. try-mean=5.67 try-max=25 n= 64 b= 64 failure-rate=0. try-mean=6.60 try-max=29 n= 128 b= 128 failure-rate=0. try-mean=7.64 try-max=22 n= 256 b= 256 failure-rate=0. try-mean=8.66 try-max=37 n= 512 b= 512 failure-rate=0. try-mean=9.57 try-max=26 n=1024 b= 1024 failure-rate=0. try-mean=10.66 try-max=34 Try-mean and try-max is how many r's needed to be tried before success, so it gives an indication how much is left before failure. For the ~1/1000 chance of failure for n=8 and n=16, we would proceed to let b=2*n (100,000 simulations): n= 8 b= 16 failure-rate=0.0001 try-mean=2.43 try-max=65 n= 16 b= 32 failure-rate=0.
Re: [Cython] Hash-based vtables
On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn wrote: > On 06/07/2012 12:35 PM, Dag Sverre Seljebotn wrote: >> >> On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote: >>> >>> On 06/07/2012 12:26 AM, Robert Bradshaw wrote: On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn wrote: > > On 06/06/2012 11:16 PM, Robert Bradshaw wrote: >> >> >> On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn >> wrote: >>> >>> >>> On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: On 06/05/2012 12:30 AM, Robert Bradshaw wrote: > > > > I just found http://cmph.sourceforge.net/ which looks quite > interesting. Though the resulting hash functions are supposedly > cheap, > I have the feeling that branching is considered cheap in this > context. Actually, this lead was *very* promising. I believe the very first reference I actually read through and didn't eliminate after the abstract totally swept away our home-grown solutions! "Hash& Displace" by Pagh (1999) is actually very simple, easy to understand, and fast both for generation and (the branch-free) lookup: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf The idea is: - Find a hash `g(x)` to partition the keys into `b` groups (the paper requires b> 2n, though I think in practice you can often get away with less) - Find a hash `f(x)` such that f is 1:1 within each group (which is easily achieved since groups only has a few elements) - For each group, from largest to smallest: Find a displacement `d[group]` so that `f(x) ^ d` doesn't cause collisions. It requires extra storage for the displacement table. However, I think 8 bits per element might suffice even for vtables of 512 or 1024 in size. Even with 16 bits it's rather negligible compared to the minimum-128-bit entries of the table. I benchmarked these hash functions: displace1: ((h>> r1) ^ d[h& 63])& m1 displace2: ((h>> r1) ^ d[h& m2])& m1 displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1 Only the third one is truly in the spirit of the algorithm, but I think the first two should work well too (and when h is known compile-time, looking up d[h& 63] isn't harder than looking up r1 or m1). My computer is acting up and all my numbers today are slower than the earlier ones (yes, I've disabled turbo-mode in the BIOS for a year ago, and yes, I've pinned the CPU speed). But here's today's numbers, compiled with -DIMHASH: direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 val=24.00 index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=18.00 twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 val=18.00 threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 val=18.00 displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 val=18.00 displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 val=18.00 displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 val=18.00 I did a dirty prototype of the table-finder as well and it works: https://github.com/dagss/hashvtable/blob/master/pagh99.py >>> >>> >>> >>> >>> The paper obviously puts more effort on minimizing table size and >>> not a >>> fast >>> lookup. My hunch is that our choice should be >>> >>> ((h>> table.r) ^ table.d[h& m2])& m1 >>> >>> >>> and use 8-bits d (because even if you have 1024 methods, you'd rather >>> double >>> the number of bins than those 2 extra bits available for displacement >>> options). >>> >>> Then keep incrementing the size of d and the number of table slots >>> (in >>> such >>> an order that the total vtable size is minimized) until success. In >>> practice >>> this should almost always just increase the size of d, and keep the >>> table >>> size at the lowest 2**k that fits the slots (even for 64 methods or >>> 128 >>> methods :-)) >>> >>> Essentially we avoid the shift in the argument to d[] by making d >>> larger. >> >> >> >> Nice. I'm surprised that the indirection on d doesn't cost us much; > > > > Well, table->d[const& const] compiles down
Re: [Cython] Hash-based vtables
On 06/09/2012 03:21 AM, Robert Bradshaw wrote: On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn wrote: On 06/07/2012 12:35 PM, Dag Sverre Seljebotn wrote: On 06/07/2012 12:20 PM, Dag Sverre Seljebotn wrote: On 06/07/2012 12:26 AM, Robert Bradshaw wrote: On Wed, Jun 6, 2012 at 2:36 PM, Dag Sverre Seljebotn wrote: On 06/06/2012 11:16 PM, Robert Bradshaw wrote: On Wed, Jun 6, 2012 at 1:57 PM, Dag Sverre Seljebotn wrote: On 06/06/2012 10:41 PM, Dag Sverre Seljebotn wrote: On 06/05/2012 12:30 AM, Robert Bradshaw wrote: I just found http://cmph.sourceforge.net/ which looks quite interesting. Though the resulting hash functions are supposedly cheap, I have the feeling that branching is considered cheap in this context. Actually, this lead was *very* promising. I believe the very first reference I actually read through and didn't eliminate after the abstract totally swept away our home-grown solutions! "Hash& Displace" by Pagh (1999) is actually very simple, easy to understand, and fast both for generation and (the branch-free) lookup: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.3753&rep=rep1&type=pdf The idea is: - Find a hash `g(x)` to partition the keys into `b` groups (the paper requires b> 2n, though I think in practice you can often get away with less) - Find a hash `f(x)` such that f is 1:1 within each group (which is easily achieved since groups only has a few elements) - For each group, from largest to smallest: Find a displacement `d[group]` so that `f(x) ^ d` doesn't cause collisions. It requires extra storage for the displacement table. However, I think 8 bits per element might suffice even for vtables of 512 or 1024 in size. Even with 16 bits it's rather negligible compared to the minimum-128-bit entries of the table. I benchmarked these hash functions: displace1: ((h>> r1) ^ d[h& 63])& m1 displace2: ((h>> r1) ^ d[h& m2])& m1 displace3: ((h>> r1) ^ d[(h>> r2)& m2])& m1 Only the third one is truly in the spirit of the algorithm, but I think the first two should work well too (and when h is known compile-time, looking up d[h& 63] isn't harder than looking up r1 or m1). My computer is acting up and all my numbers today are slower than the earlier ones (yes, I've disabled turbo-mode in the BIOS for a year ago, and yes, I've pinned the CPU speed). But here's today's numbers, compiled with -DIMHASH: direct: min=5.37e-09 mean=5.39e-09 std=1.96e-11 val=24.00 index: min=6.45e-09 mean=6.46e-09 std=1.15e-11 val=18.00 twoshift: min=6.99e-09 mean=7.00e-09 std=1.35e-11 val=18.00 threeshift: min=7.53e-09 mean=7.54e-09 std=1.63e-11 val=18.00 displace1: min=6.99e-09 mean=7.00e-09 std=1.66e-11 val=18.00 displace2: min=6.99e-09 mean=7.02e-09 std=2.77e-11 val=18.00 displace3: min=7.52e-09 mean=7.54e-09 std=1.19e-11 val=18.00 I did a dirty prototype of the table-finder as well and it works: https://github.com/dagss/hashvtable/blob/master/pagh99.py The paper obviously puts more effort on minimizing table size and not a fast lookup. My hunch is that our choice should be ((h>> table.r) ^ table.d[h& m2])& m1 and use 8-bits d (because even if you have 1024 methods, you'd rather double the number of bins than those 2 extra bits available for displacement options). Then keep incrementing the size of d and the number of table slots (in such an order that the total vtable size is minimized) until success. In practice this should almost always just increase the size of d, and keep the table size at the lowest 2**k that fits the slots (even for 64 methods or 128 methods :-)) Essentially we avoid the shift in the argument to d[] by making d larger. Nice. I'm surprised that the indirection on d doesn't cost us much; Well, table->d[const& const] compiles down to the same kind of code as table->m1. I guess I'm surprised too that displace2 doesn't penalize. hopefully its size wouldn't be a big issue either. What kinds of densities were you achieving? OK, simulation results just in (for the displace2 hash), and they exceeded my expectations. I always fill the table with n=2^k keys, and fix b = n (b means |d|). Then the failure rates are (top two are 100,000 simulations, the rest are 1000 simulations): n= 8 b= 8 failure-rate=0.0019 try-mean=4.40 try-max=65 n= 16 b= 16 failure-rate=0.0008 try-mean=5.02 try-max=65 n= 32 b= 32 failure-rate=0. try-mean=5.67 try-max=25 n= 64 b= 64 failure-rate=0. try-mean=6.60 try-max=29 n= 128 b= 128 failure-rate=0. try-mean=7.64 try-max=22 n= 256 b= 256 failure-rate=0. try-mean=8.66 try-max=37 n= 512 b= 512 failure-rate=0. try-mean=9.57 try-max=26 n=1024 b= 1024 failure-rate=0. try-mean=10.66 try-max=34 Try-mean and try-max is how many r's needed to be tried before success, so it gives an indication how much is left before failure. For the ~1/1000 chance of failure for n=8 and n=1
Re: [Cython] Hash-based vtables
On 06/09/2012 07:45 AM, Dag Sverre Seljebotn wrote: On 06/09/2012 03:21 AM, Robert Bradshaw wrote: On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn There's still the indirection through SEP 200 (extensibletype slots). We can get rid of that very easily by just making that table and the hash-vtable one and the same. (It could still either have interned string keys or ID keys depending on the least significant bit.) Or we can even forgo the interning for this table, and give up on partitioning the space numerically and just use the dns-style prefixing, e.g. "org.cython.X" belongs to us. Huh? Isn't that when you *need* interning? Do you plan on key-encoding those kind of strings into 64 bits? (I think it would usually be "method:foo:ii->d" (or my current preference is "method:foo:i4i8->f8")) Well, I guess something like "org.cython.X" would happen often as well, in addition. Just put it all in the same table :-) Partitioning the space numerically you'd just hash the number; "SEP 260: We use id 0x70040001, which has lower-64-md5 0xfa454a...ULL". The real use-case I see for this now is in having the PyArray_DATA etc. access pointers simply through compile-time constants the library can define on both ends. It could just do PyCustomSlots_Lookup(obj->ob_type, 0x70040001, 0xfa45323...ULL) specifically to get a function retrieving the data-pointer. PyArray_SHAPE would do PyCustomSlots_Lookup(obj->ob_type, 0x70040002, 0xbbad423...ULL) Also, I'd want PyExtensibleType_Object to have: { ... PyPerfectTable *tp_perfect_table; Py_ssize_t tp_perfect_table_obj_offset; } i.e. we allow for getting quickly to a table on the object in addition to the one on the type. Callbacks look up the one on the object first (before potentially checking for __call__ in the type); method-calling might ignore the one on the object. Dag There is value in the double indirection if this (or any of the other) lookup tables are meant to be modified over time. This isn't impossible with a hash table either. You just need to reallocate a little more often than what would be the case with a regular hash table, but not dramatically so (you need to rehash whenever the element to insert hashes into a "large" bin, which are rather few). I want the table to have a pointer to it, so that you can atomically swap it out. To wrap up, I think this has grown in complexity beyond the "simple SEP spec". It's at the point where you don't really want to have several libraries implementing the same simple spec, but instead use the same implementation. But I think the advantages are simply too good to give up on. So I think a viable route forward is to forget the CEP/SEP/pre-PEP-approach for now (which only works for semi-complicated ideas with simple implementations) and instead simply work more directly on a library. It would need to have a couple of different use modes: I prefer an enhancement proposal with a spec over a library, even if only a single library gets used in practice. I still think it's simple enough. Basically, we have the "lookup spec" and then a CEP for applying this to fast callable (agreeing on signatures, and what to do with extern types) and extensible type slots. OK. - A Python perfect-hasher for use when generating code, with only the a string interner based on CPython dicts and extensibletype metaclass as runtime dependencies (for use in Cython). This would only add some hundred source file lines... - A C implementation of the perfect hashing exposed through a PyPerfectHashTable_Ready(), for use in libraries written in C like NumPy/SciPy). This would need to bundle the md5 algorithm and a C implementation of the perfect hashing. And on the distribution axis: - Small C header-style implementation of a string interner and the extensibletype metaclass, rendezvousing through sys.modules - As part of the rendezvous, one would always try to __import__ the *real* run-time library. So if it is available in sys.path it overrides anything bundled with other libraries. That would provide a way forward for GIL-less string interning, a Python-side library for working with these tables and inspecting them, etc. Hmm, that's an interesting idea. I think we don't actually need interning, in which case the "full" library is only needed for introspection. You don't believe the security concern is real then? Or do you want to pay the cost for a 160-bit SHA1 compare everywhere? I'd love to not do interning, but I see no way around it. BTW, a GIL-less interning library isn't rocket science. I ran khash.h through a preprocessor with KHASH_MAP_INIT_STR(str_to_entry, entry_t) and the result is 180 lines of code for the hash table. Then pythread.h provides the thread lock, another 50 lines for the interning logic (intern_literal, intern_heap_allocated, release_interned). It just seems a little redundant to ship such a thing in every Cython-generated file since we hold the GIL during
Re: [Cython] Hash-based vtables
On 06/09/2012 08:00 AM, Dag Sverre Seljebotn wrote: On 06/09/2012 07:45 AM, Dag Sverre Seljebotn wrote: On 06/09/2012 03:21 AM, Robert Bradshaw wrote: On Fri, Jun 8, 2012 at 2:12 PM, Dag Sverre Seljebotn There's still the indirection through SEP 200 (extensibletype slots). We can get rid of that very easily by just making that table and the hash-vtable one and the same. (It could still either have interned string keys or ID keys depending on the least significant bit.) Or we can even forgo the interning for this table, and give up on partitioning the space numerically and just use the dns-style prefixing, e.g. "org.cython.X" belongs to us. Huh? Isn't that when you *need* interning? Do you plan on key-encoding those kind of strings into 64 bits? (I think it would usually be "method:foo:ii->d" (or my current preference is "method:foo:i4i8->f8")) Well, I guess something like "org.cython.X" would happen often as well, in addition. Just put it all in the same table :-) Partitioning the space numerically you'd just hash the number; "SEP 260: We use id 0x70040001, which has lower-64-md5 0xfa454a...ULL". The real use-case I see for this now is in having the PyArray_DATA etc. access pointers simply through compile-time constants the library can define on both ends. It could just do PyCustomSlots_Lookup(obj->ob_type, 0x70040001, 0xfa45323...ULL) specifically to get a function retrieving the data-pointer. PyArray_SHAPE would do PyCustomSlots_Lookup(obj->ob_type, 0x70040002, 0xbbad423...ULL) Argh. I meant 0x70040002 | 1, of course ;-) DS Also, I'd want PyExtensibleType_Object to have: { ... PyPerfectTable *tp_perfect_table; Py_ssize_t tp_perfect_table_obj_offset; } i.e. we allow for getting quickly to a table on the object in addition to the one on the type. Callbacks look up the one on the object first (before potentially checking for __call__ in the type); method-calling might ignore the one on the object. Dag There is value in the double indirection if this (or any of the other) lookup tables are meant to be modified over time. This isn't impossible with a hash table either. You just need to reallocate a little more often than what would be the case with a regular hash table, but not dramatically so (you need to rehash whenever the element to insert hashes into a "large" bin, which are rather few). I want the table to have a pointer to it, so that you can atomically swap it out. To wrap up, I think this has grown in complexity beyond the "simple SEP spec". It's at the point where you don't really want to have several libraries implementing the same simple spec, but instead use the same implementation. But I think the advantages are simply too good to give up on. So I think a viable route forward is to forget the CEP/SEP/pre-PEP-approach for now (which only works for semi-complicated ideas with simple implementations) and instead simply work more directly on a library. It would need to have a couple of different use modes: I prefer an enhancement proposal with a spec over a library, even if only a single library gets used in practice. I still think it's simple enough. Basically, we have the "lookup spec" and then a CEP for applying this to fast callable (agreeing on signatures, and what to do with extern types) and extensible type slots. OK. - A Python perfect-hasher for use when generating code, with only the a string interner based on CPython dicts and extensibletype metaclass as runtime dependencies (for use in Cython). This would only add some hundred source file lines... - A C implementation of the perfect hashing exposed through a PyPerfectHashTable_Ready(), for use in libraries written in C like NumPy/SciPy). This would need to bundle the md5 algorithm and a C implementation of the perfect hashing. And on the distribution axis: - Small C header-style implementation of a string interner and the extensibletype metaclass, rendezvousing through sys.modules - As part of the rendezvous, one would always try to __import__ the *real* run-time library. So if it is available in sys.path it overrides anything bundled with other libraries. That would provide a way forward for GIL-less string interning, a Python-side library for working with these tables and inspecting them, etc. Hmm, that's an interesting idea. I think we don't actually need interning, in which case the "full" library is only needed for introspection. You don't believe the security concern is real then? Or do you want to pay the cost for a 160-bit SHA1 compare everywhere? I'd love to not do interning, but I see no way around it. BTW, a GIL-less interning library isn't rocket science. I ran khash.h through a preprocessor with KHASH_MAP_INIT_STR(str_to_entry, entry_t) and the result is 180 lines of code for the hash table. Then pythread.h provides the thread lock, another 50 lines for the interning logic (intern_literal, intern_heap_allocated, release_interned). It just seems a little redundant