On Fri, Apr 13, 2012 at 3:06 PM, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> wrote: > > > Robert Bradshaw <rober...@gmail.com> wrote: > >>On Fri, Apr 13, 2012 at 1:27 PM, Dag Sverre Seljebotn >><d.s.seljeb...@astro.uio.no> wrote: >>> Ah, I didn't think about 6-bit or huffman. Certainly helps. >> >>Yeah, we don't want to complicate the ABI too much, but I think >>something like 8 4-bit common chars and 32 6-bit other chars (or 128 >>8-bit other chars) wouldn't be outrageous. The fact that we only have >>to encode into a single word makes the algorithm very simple (though >>the majority of the time we'd spit out pre-encoded literals). We have >>a version number to play with this as well. >> >>> I'm almost +1 on your proposal now, but a couple of more ideas: >>> >>> 1) Let the key (the size_t) spill over to the next specialization >>entry if >>> it is too large; and prepend that key with a continuation code (two >>size-ts >>> could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding, >>using >>> - as continuation). The key-based caller will expect a continuation >>if it >>> knows about the specialization, and the prepended char will prevent >>spurios >>> matches against the overspilled slot. >>> >>> We could even use the pointers for part of the continuation... >>> >>> 2) Separate the char* format strings from the keys, ie this memory >>layout: >>> >>> >>Version,nslots,nspecs,funcptr,key,funcptr,key,...,sigcharptr,sigcharptr... >>> >>> Where nslots is larger than nspecs if there are continuations. >>> >>> OK, this is getting close to my original proposal, but the difference >>is the >>> contiunation char, so that if you expect a short signature, you can >>safely >>> scan every slot and branching and no null-checking necesarry. >> >>I don't think we need nslots (though it might be interesting). My >>thought is that once you start futzing with variable-length keys, you >>might as well just compare char*s. > > This is where we disagree. If you are the caller you know at compile-time how > much you want to match; I think comparing 2 or 3 size-t with no looping is a > lot better (a fully-unrolled, 64-bit per instruction strcmp with one of the > operands known to the compiler...).
Doesn't the compiler unroll strcmp much like this for a known operand? >>If one is concerned about memory, one could force the sigcharptr to be >>aligned, and then the "keys" could be either sigcharptr or key >>depending on whether the least significant bit was set. One could >>easily scan for/switch on a key and scanning for a char* would be >>almost as easy (just don't dereference if the lsb is set). >> >>I don't see us being memory constrained, so >> >>(version,nspecs,futureuse),(key,sigcharptr,funcptr)*,optionalsigchardata* >> >>seems fine to me even if only one of key/sigchrptr is ever used per >>spec. Null-terminating the specs would work fine as well (one less >>thing to keep track of during iteration). > > Well, can't one always use more L1 cache, or is that not a concern? If you > have 5-6 different routines calling each other using this mechanism, each > with multiple specializations, those unused slots translate to many cache > lines wasted. > > I don't think it is that important, I just think that how pretty the C struct > declaration ends up looking should not be a concern at all, when the whole > point of this is speed anyway. You can always just use a throwaway struct > declaration and a cast to get whatever layout you need. If the 'padding' > leads to less branching then fine, but I don't see that it helps in any way. I was more concerned about guaranteeing each char* was aligned. > To refine my proposal a bit, we have a list of variable size entries, > > (keydata, keydata, ..., funcptr) > > where each keydata and the ptr is 64 bits on all platforms (see below); each > entry must have a total length multiple of 128 bits (so that one can safely > scan for a signature in 128 bit increments in the data *without* parsing or > branching, you'll never hit a pointer), and each key but the first starts > with a 'dash'. Ah, OK, similar to UTF-8. Yes, I like this idea. > Signature strings are either kept separate, or even parsed/decoded from the > keys. We really only care about speed when you have compiled or JITed code > for the case, decoding should be fine otherwise. True. > BTW, won't the Cython-generated C code be a horrible mess if we use size-t > rather than insist on int64t? (ok, those need some ifdefs for various > compilers, but still seem cleaner than operating with 32bit and 64bit keys, > and stdint.h is winning ground). Sure, we could require 64-bit keys (and pointer slots). On Fri, Apr 13, 2012 at 3:22 PM, Dag Sverre Seljebotn <d.s.seljeb...@astro.uio.no> wrote: >>> I am really lost here. Why is any of this complicated encoding stuff >>> better than interning? Interning takes one line of code, is >>incredibly >>> cheap (one dict lookup per call site and function definition), and it >>> lets you check any possible signature (even complicated ones >>involving >>> memoryviews) by doing a single-word comparison. And best of all, you >>> don't have to think hard to make sure you got the encoding right. ;-) >>> >>> On a 32-bit system, pointers are smaller than a size_t, but more >>> expressive! You can still do binary search if you want, etc. Is the >>> problem just that interning requires a runtime calculation? Because I >>> feel like C users (like numpy) will want to compute these compressed >>> codes at module-init anyway, and those of us with a fancy compiler >>> capable of computing them ahead of time (like Cython) can instruct >>> that fancy compiler to compute them at module-init time just as >>> easily? >> >>Good question. >> >>The primary disadvantage of interning that I see is memory locality. I >>suppose if all the C-level caches of interned values were co-located, >>this may not be as big of an issue. Not being able to compare against >>compile-time constants may thwart some optimization opportunities, but >>that's less clear. >> >>It also requires coordination common repository, but I suppose one >>would just stick a set in some standard module (or leverage Python's >>interning). > > More problems: > > 1) It doesn't work well with multiple interpreter states. Ok, nothing works > with that at the moment, but it is on the roadmap for Python and we should > not make it worse. > > You basically *need* a thread safe store separate from any python > interpreter; though pythread.h does not rely on the interpreter state; which > helps. I didn't know about the push for multiple interpreter states, but yeah, that makes things much more painful. > 2) you end up with the known comparison values in read-write memory segments > rather than readonly segments, which is probably worse on multicore systems? Yeah, this is the kind of stuff I was vaguely worried about when I wrote "Not being able to compare against compile-time constants may thwart some optimization opportunities." I don't know what the impact is, but it's worth trying to measure and take into account. > I really think that anything that we can do to make this near-c-speed should > be done; none of the proposals are *that* complicated. > > Using keys, NumPy can in the C code choose to be slower but more readable; > but using interned string forces cython to be slower, cython gets no way of > choosing to go faster. (to the degree that it has an effect; none of these > claims were checked) Yep, agreed. - Robert _______________________________________________ cython-devel mailing list cython-devel@python.org http://mail.python.org/mailman/listinfo/cython-devel