Re: [Cython] CEP1000: Native dispatch through callables

Robert Bradshaw Fri, 13 Apr 2012 16:47:21 -0700

On Fri, Apr 13, 2012 at 3:06 PM, Dag Sverre Seljebotn
<d.s.seljeb...@astro.uio.no> wrote:
>
>
> Robert Bradshaw <rober...@gmail.com> wrote:
>
>>On Fri, Apr 13, 2012 at 1:27 PM, Dag Sverre Seljebotn
>><d.s.seljeb...@astro.uio.no> wrote:
>>> Ah, I didn't think about 6-bit or huffman. Certainly helps.
>>
>>Yeah, we don't want to complicate the ABI too much, but I think
>>something like 8 4-bit common chars and 32 6-bit other chars (or 128
>>8-bit other chars) wouldn't be outrageous. The fact that we only have
>>to encode into a single word makes the algorithm very simple (though
>>the majority of the time we'd spit out pre-encoded literals). We have
>>a version number to play with this as well.
>>
>>> I'm almost +1 on your proposal now, but a couple of more ideas:
>>>
>>> 1) Let the key (the size_t) spill over to the next specialization
>>entry if
>>> it is too large; and prepend that key with a continuation code (two
>>size-ts
>>> could together say "iii)-d\0\0" on 32 bit systems with 8bit encoding,
>>using
>>> - as continuation). The key-based caller will expect a continuation
>>if it
>>> knows about the specialization, and the prepended char will prevent
>>spurios
>>> matches against the overspilled slot.
>>>
>>> We could even use the pointers for part of the continuation...
>>>
>>> 2) Separate the char* format strings from the keys, ie this memory
>>layout:
>>>
>>>
>>Version,nslots,nspecs,funcptr,key,funcptr,key,...,sigcharptr,sigcharptr...
>>>
>>> Where nslots is larger than nspecs if there are continuations.
>>>
>>> OK, this is getting close to my original proposal, but the difference
>>is the
>>> contiunation char, so that if you expect a short signature, you can
>>safely
>>> scan every slot and branching and no null-checking necesarry.
>>
>>I don't think we need nslots (though it might be interesting). My
>>thought is that once you start futzing with variable-length keys, you
>>might as well just compare char*s.
>
> This is where we disagree. If you are the caller you know at compile-time how 
> much you want to match; I think comparing 2 or 3 size-t with no looping is a 
> lot better (a fully-unrolled, 64-bit per instruction strcmp with one of the 
> operands known to the compiler...).


Doesn't the compiler unroll strcmp much like this for a known operand?

>>If one is concerned about memory, one could force the sigcharptr to be
>>aligned, and then the "keys" could be either sigcharptr or key
>>depending on whether the least significant bit was set. One could
>>easily scan for/switch on a key and scanning for a char* would be
>>almost as easy (just don't dereference if the lsb is set).
>>
>>I don't see us being memory constrained, so
>>
>>(version,nspecs,futureuse),(key,sigcharptr,funcptr)*,optionalsigchardata*
>>
>>seems fine to me even if only one of key/sigchrptr is ever used per
>>spec. Null-terminating the specs would work fine as well (one less
>>thing to keep track of during iteration).
>
> Well, can't one always use more L1 cache, or is that not a concern? If you 
> have 5-6 different routines calling each other using this mechanism, each 
> with multiple specializations, those unused slots translate to many cache 
> lines wasted.
>
> I don't think it is that important, I just think that how pretty the C struct 
> declaration ends up looking should not be a concern at all, when the whole 
> point of this is speed anyway. You can always just use a throwaway struct 
> declaration and a cast to get whatever layout you need. If the 'padding' 
> leads to less branching then fine, but I don't see that it helps in any way.

I was more concerned about guaranteeing each char* was aligned.

> To refine my proposal a bit, we have a list of variable size entries,
>
> (keydata, keydata, ..., funcptr)
>
> where each keydata and the ptr is 64 bits on all platforms (see below); each 
> entry must have a total length multiple of 128 bits (so that one can safely 
> scan for a signature in 128 bit increments in the data *without* parsing or 
> branching, you'll never hit a pointer), and each key but the first starts 
> with a 'dash'.

Ah, OK, similar to UTF-8. Yes, I like this idea.

> Signature strings are either kept separate, or even parsed/decoded from the 
> keys. We really only care about speed when you have compiled or JITed code 
> for the case, decoding should be fine otherwise.

True.

> BTW, won't the Cython-generated C code be a horrible mess if we use size-t 
> rather than insist on int64t? (ok, those need some ifdefs for various 
> compilers, but still seem cleaner than operating with 32bit and 64bit keys, 
> and stdint.h is winning ground).

Sure, we could require 64-bit keys (and pointer slots).



On Fri, Apr 13, 2012 at 3:22 PM, Dag Sverre Seljebotn
<d.s.seljeb...@astro.uio.no> wrote:
>>> I am really lost here. Why is any of this complicated encoding stuff
>>> better than interning? Interning takes one line of code, is
>>incredibly
>>> cheap (one dict lookup per call site and function definition), and it
>>> lets you check any possible signature (even complicated ones
>>involving
>>> memoryviews) by doing a single-word comparison. And best of all, you
>>> don't have to think hard to make sure you got the encoding right. ;-)
>>>
>>> On a 32-bit system, pointers are smaller than a size_t, but more
>>> expressive! You can still do binary search if you want, etc. Is the
>>> problem just that interning requires a runtime calculation? Because I
>>> feel like C users (like numpy) will want to compute these compressed
>>> codes at module-init anyway, and those of us with a fancy compiler
>>> capable of computing them ahead of time (like Cython) can instruct
>>> that fancy compiler to compute them at module-init time just as
>>> easily?
>>
>>Good question.
>>
>>The primary disadvantage of interning that I see is memory locality. I
>>suppose if all the C-level caches of interned values were co-located,
>>this may not be as big of an issue. Not being able to compare against
>>compile-time constants may thwart some optimization opportunities, but
>>that's less clear.
>>
>>It also requires coordination common repository, but I suppose one
>>would just stick a set in some standard module (or leverage Python's
>>interning).
>
> More problems:
>
> 1) It doesn't work well with multiple interpreter states. Ok, nothing works 
> with that at the moment, but it is on the roadmap for Python and we should 
> not make it worse.
>
> You basically *need* a thread safe store separate from any python 
> interpreter; though pythread.h does not rely on the interpreter state; which 
> helps.

I didn't know about the push for multiple interpreter states, but
yeah, that makes things much more painful.

> 2) you end up with the known comparison values in read-write memory segments 
> rather than readonly segments, which is probably worse on multicore systems?

Yeah, this is the kind of stuff I was vaguely worried about when I
wrote "Not being able to compare against compile-time constants may
thwart some optimization opportunities." I don't know what the impact
is, but it's worth trying to measure and take into account.

> I really think that anything that we can do to make this near-c-speed should 
> be done; none of the proposals are *that* complicated.
>
> Using keys, NumPy can in the C code choose to be slower but more readable; 
> but using interned string forces cython to be slower, cython gets no way of 
> choosing to go faster. (to the degree that it has an effect; none of these 
> claims were checked)

Yep, agreed.

- Robert
_______________________________________________
cython-devel mailing list
cython-devel@python.org
http://mail.python.org/mailman/listinfo/cython-devel

Re: [Cython] CEP1000: Native dispatch through callables

Reply via email to