------- Comment #4 from zsojka at seznam dot cz 2009-07-17 11:03 ------- > The zero extension is done to avoid partial register stalls.
I am sorry, this is explanation to me that the generated code is supposedly fastest, but only because of some "undocumented/unlucky" conditions the benchmark shows different result? (and so this task can be possibly closed because there is no way to determinically improve generated code) Or do you say "the code responsible for eliminating partial register stalls does bad job here because when using only 'ah' and 'eax', there is no _false_ register dependency"? I wasn't sure if this has something to do with "partial register stall elimination" because the following, very similiar (and functionally identical) code: ------------------------------------------------ uint8_t data[16]; static __attribute__((noinline)) void bar(unsigned i) { unsigned j; for (j = 0; j < 16; j++) data[j] = (i + j) >> 8; } ------------------------------------------------ Is compiled as: ------------------------------------------------ bar: .LFB12: .cfi_startproc movl %edi, %eax shrl $8, %eax movb %al, data(%rip) leal 1(%rdi), %eax shrl $8, %eax movb %al, data+1(%rip) leal 2(%rdi), %eax shrl $8, %eax movb %al, data+2(%rip) leal 3(%rdi), %eax shrl $8, %eax movb %al, data+3(%rip) leal 4(%rdi), %eax ... ------------------------------------------------ There is no "partial register stall elimination", the only difference is "shr" instead of "movzx". So I thought that: - the version with "mask & 0xFF00" is decyphered as 'only second byte is masked out and then shifted right by 8b, so "ah" can be moved to "al" (resp. whole eax)' - the version without "mask" is not decyphered as reading only second byte, so do just "shift right" of the working register -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=40772