Daniel Marschall <daniel-marsch...@viathinksoft.de> writes: > As I was optimizing my program, I found a few things which looked odd > to me in the assembler code.
Thanks. It's often best to report missed optimizations at http://gcc.gnu.org/bugzilla/ . They will tend to be forgotten on the mailing list. > I am on an AMD x64_32 box running Debian Squeeze, GCC: (Debian > 4.4.5-8) 4.4.5. Note that the current GCC release is 4.7.0. > #ifdef addcast > // Contains a cast to "unsigned long long int", which > was not done by "-O3" > // This cast makes the output 1 OP code shorter > // imulq %rdi, %rdx # tmp80, tmp81 > // addq %rdx, %rcx # tmp81, c > c += (unsigned long long int)a[idx_a] * a[idx_b]; > #else > // Using "-O3", it produces 1 OP code which could be optimized > away > // imull %edi, %edx # tmp80, tmp81 <-- the > compiler should use imulq instead of imull > // movslq %edx,%rdx # tmp81, tmp82 <-- not > neccessary... BETTER: optimize away using imulq ! > // addq %rdx, %rcx # tmp82, c > c += a[idx_a] * a[idx_b]; > #endif This cast changes the meaning of the code, so it's not surprising that you see different assembler instructions. The first case above will do the multiplication in the type "unsigned long long". In the second case the "unsigned char" values are zero-extended to int, and the multiplication is done in the type "int". Then the "int" result is sign-extended to "unsigned long long" for the addition. In this case it's true that the compiler could convert the code as you suggest, based on the knowledge that the int values are always in the range 0 to 255. However, it's not clear to me that using imulq would be better. My copy of the Intel optimization manual suggests that imull has slightly lower latency than imulq, so I think that in many cases imull would be preferred. > Compiling following program: > > #include <stdio.h> > #include <strings.h> > int main(void) { > volatile unsigned char a = 4; > volatile unsigned char b = 6; > volatile unsigned long long int c = a * b; > return c; > } > > produces: > > .file "main.c" > .text > .p2align 4,,15 > .globl main > .type main, @function > main: > .LFB16: > .cfi_startproc > .cfi_personality 0x3,__gxx_personality_v0 > movb $4, -1(%rsp) > movb $6, -2(%rsp) > movzbl -1(%rsp), %edx > movzbl -2(%rsp), %eax > movzbl %dl, %edx > movzbl %al, %eax > imull %edx, %eax > cltq > movq %rax, -16(%rsp) # REDUNDANT?? > movq -16(%rsp), %rax # REDUNDANT?? > ret > .cfi_endproc > .LFE16: > .size main, .-main > .ident "GCC: (Debian 4.4.5-8) 4.4.5" > .section .note.GNU-stack,"",@progbits > > AFAIK, the two movq statements are redundant. What do they do? The > just do rax=rsp[-16] and rsp[-16]=rax . Or am I wrong? Those movq instructions exist because you declared c as volatile. A volatile local variable must live on the stack. The first instruction stores the value into the local variable c. The second retrieves the value for the return statement. In general uses of volatile variables are not optimized. That is intentional and based on the definition of volatile in the language standard. So it takes a pretty high bar to argue about a missing optimization for a volatile variable. Ian