------- Comment #4 from zsojka at seznam dot cz  2009-07-17 11:03 -------
> The zero extension is done to avoid partial register stalls.

I am sorry, this is explanation to me that the generated code is supposedly
fastest, but only because of some "undocumented/unlucky" conditions the
benchmark shows different result? (and so this task can be possibly closed
because there is no way to determinically improve generated code)
Or do you say "the code responsible for eliminating partial register stalls
does bad job here because when using only 'ah' and 'eax', there is no _false_
register dependency"?

I wasn't sure if this has something to do with "partial register stall
elimination" because the following, very similiar (and functionally identical)
code:
------------------------------------------------
uint8_t data[16];

static __attribute__((noinline)) void bar(unsigned i)
{
        unsigned j;
        for (j = 0; j < 16; j++)
                data[j] = (i + j) >> 8;
}
------------------------------------------------

Is compiled as:
------------------------------------------------
bar:
.LFB12:
        .cfi_startproc
        movl    %edi, %eax
        shrl    $8, %eax
        movb    %al, data(%rip)
        leal    1(%rdi), %eax
        shrl    $8, %eax
        movb    %al, data+1(%rip)
        leal    2(%rdi), %eax
        shrl    $8, %eax
        movb    %al, data+2(%rip)
        leal    3(%rdi), %eax
        shrl    $8, %eax
        movb    %al, data+3(%rip)
        leal    4(%rdi), %eax
...
------------------------------------------------
There is no "partial register stall elimination", the only difference is "shr"
instead of "movzx".

So I thought that:
- the version with "mask & 0xFF00" is decyphered as 'only second byte is masked
out and then shifted right by 8b, so "ah" can be moved to "al" (resp. whole
eax)'
- the version without "mask" is not decyphered as reading only second byte, so
do just "shift right" of the working register


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772

Reply via email to