Useless conditional branches
It looks like gcc sometimes produces "useless" conditional branches. I've found code like this: xor%edx,%edx ; code with no effect on edx (see full code below) test %edx,%edx jne The branch on the last line is never taken. Why does gcc generate such code sequences? Is this patched at runtime, or something? Am I missing something obvious here? I append the function's complete code below. There is another suspicious branch at 0xb31cd8 (never taken, for less obvious reasons---edx is never zero at that point). I have found hundreds such occurrences across the CPU2006 suite. Does anybody have any idea why this happens? Is there any specific optimization to enable or disable to avoid such dead edges? Thanks in advance for any remark/idea/... This code is from 416.gamess (from SPEC CPU2006), function "formf", compiled with "gcc version 4.4.1 (Ubuntu 4.4.1-4ubuntu9)" (from a stock ubuntu 9.10), with options "-g -O3 -march=native -fno-optimize-sibling-calls". 416.gamess is compiled with gfortran, but the same thing happens with C or C++ programs. The same also happens at lower optimization levels (-01), but less frequently. uname -m gives "x86_64", and /proc/cpuinfo contains: vendor_id : GenuineIntel cpu family : 6 model : 23 model name : Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz [...] flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm ida tpr_shadow vnmi flexpriority Here is an objdump disassembly of the code, broken into basic-blocks: 00b31c60 : b31c60: push %rbx b31c61: xor%edx,%edx b31c63: mov%rsi,%rbx b31c66: mov$0x2,%r9d b31c6c: mov$0xa,%r8d b31c72: mov$0x6,%esi b31c77: mov$0xe,%ecx b31c7c: test %edx,%edx b31c7e: jneb31cda b31c80: mov$0x3ff8,%r11 b31c8a: mov$0xbfe0,%r10 b31c94: mov%r11,(%rdi) b31c97: movq $0x0,0x40(%rdi) b31c9f: movq $0x0,0x20(%rdi) b31ca7: mov%r10,0x60(%rdi) b31cab: xor%eax,%eax b31cad: nopl (%rax) b31cb0: inc%edx b31cb2: movq $0x0,0x10(%rdi,%rax,8) b31cbb: movq $0x0,0x50(%rdi,%rax,8) b31cc4: movq $0x0,0x30(%rdi,%rax,8) b31ccd: movq $0x0,0x70(%rdi,%rax,8) b31cd6: test %edx,%edx b31cd8: je b31c80 b31cda: cmp$0x2,%edx b31cdd: jneb31d18 b31cdf: mov$0xbfe0,%r11 b31ce9: mov$0xbfe0,%r10 b31cf3: mov%r11,(%rdi,%r9,8) b31cf7: mov$0x2,%eax b31cfc: movq $0x0,(%rdi,%r8,8) b31d04: movq $0x0,(%rdi,%rsi,8) b31d0c: mov%r10,(%rdi,%rcx,8) b31d10: jmpb31cb0 b31d12: nopw 0x0(%rax,%rax,1) b31d18: movslq %edx,%rax b31d1b: cmp$0x1,%edx b31d1e: movq $0x0,(%rdi,%rax,8) b31d26: movq $0x0,0x40(%rdi,%rax,8) b31d2f: movq $0x0,0x20(%rdi,%rax,8) b31d38: movq $0x0,0x60(%rdi,%rax,8) b31d41: jneb31cb0 b31d47: movq $0x0,0x50(%rdi,%rax,8) b31d50: movq $0x0,0x30(%rdi,%rax,8) b31d59: mov$0x3fe0,%r9 b31d63: mov$0xbff8,%r8 b31d6d: mov%r9,0x10(%rdi,%rax,8) b31d72: mov%r8,0x70(%rdi,%rax,8) b31d77: mov$0xd0f838,%edx b31d7c: mov$0xd0f7b0,%esi b31d81: mov%rbx,%rdi b31d84: xor%eax,%eax b31d86: callq 977b20 b31d8b: mov$0x3fe0,%rsi b31d95: mov$0x3fe0,%rcx b31d9f: mov%rsi,(%rbx) b31da2: mov%rcx,0x60(%rbx) b31da6: mov$0xbfe0,%rdx b31db0: mov$0xbfe0,%rax b31dba: mov%rdx,0x18(%rbx) b31dbe: mov%rax,0x78(%rbx) b31dc2: pop%rbx b31dc3: retq Let me know if more detail is needed. -- Alain.
Re: Useless conditional branches
Andrew Haley wrote: On 03/02/2010 08:55 AM, Alain Ketterlin wrote: It looks like gcc sometimes produces "useless" conditional branches. I've found code like this: xor%edx,%edx ; code with no effect on edx (see full code below) test %edx,%edx jne The branch on the last line is never taken. Why does gcc generate such code sequences? Is this patched at runtime, or something? Am I missing something obvious here? We really need a test case, with source, that illustrates the problem. When we have that, we can treat is as a missed-optimization bug. Sure. I can provide a list of functions from the SPEC CPU2006 where it happens. Here is the list for 403.gcc, at -01 and -03 (the CPU2006 gcc is based on gcc-3.2, according to SPEC, so probably different from the actual code base). === 403.gcc (-O1) FUN init_builtins 0x43582f BLOCK 0x43582f FUN cpp_finish_options 0x435a39 BLOCK 0x435a52 FUN life_analysis 0x4be8f7 BLOCK 0x4beb79 FUN optimize_mode_switching 0x580b9a BLOCK 0x580ea4 === 403.gcc (-O3) FUN start_function 0x413af0 BLOCK 0x41410e FUN init_builtins 0x442640 BLOCK 0x442640 FUN cpp_finish_options 0x442ec0 BLOCK 0x442f28 FUN can_combine_p 0x4770f0 BLOCK 0x477196 FUN gen_rtx 0x4bf440 BLOCK 0x4bf5a8 FUN expand_shift 0x4ca5a0 BLOCK 0x4ca6ce FUN life_analysis 0x4edbc0 BLOCK 0x4ee03d FUN htab_create 0x6a1940 BLOCK 0x6a1940 FUN htab_expand 0x6a1aa0 BLOCK 0x6a1aa0 FUN htab_try_create 0x6a1ee0 BLOCK 0x6a1ee0 (the addresses may be meaningless, but if both addresses are equal it means that the problem appears on the entry block). I can't guarantee this list is exhaustive. My full list for CPU2006 programs is a bit too long to post on the mailing list (around a thousand lines). I can send it by mail (is a list of function names enough?). I'll try to extract a minimal example if you want and/or have no access to CPU2006. Give me one or two days. -- Alain.
Sub-optimal code
I've reported here recently about gcc producing conditional branches with static outcome. I've finally managed to produce a simple self-contained example. Here it is: int f() { static int LSW=-1; double x=0.987654321; unsigned char ix = *((char *)&x); if(ix==0xdd || ix==0x3f) LSW = 1; else LSW = 0; return 0; } Compiling this with "gcc-4.5 -O3 -c f.c" (20100304 snapshot) on my x86-64 system produces the following code: 0: b8 b8 ff ff ff mov$0xffb8,%eax 5: 3c dd cmp$0xdd,%al 7: 0f 94 c0sete %al ... Ok, not a branch, but still useless code. gcc-4.4 does the same, and gcc-4.3 is slightly less clever (it moves the whole immediate 8 bytes into %rax and does two cmp+je). I first thought the type-cast was responsible. But if you replace the assignment to LSW with a simple return, gcc correctly optimizes the whole function away. Can anybody confirm this? Is it worth a bug report? -- Alain.
Irreducible loops in generated code
I'm working on decompiling x86-64 binary programs, using branches to rebuild a control-flow graph and looking for loops. I've found a significant number of irreducible loops in gcc-produced code (irreducible loops are loops with more than one entry point), especially in -O3 optimized binaries, even when the source code is "well" structured. My experiments are done mainly on the SPEC CPU-2006 benchmarks. My question is: where do these irreducible loops come from? Which optimization pass leads to irreducible regions? Thanks in advance for any pointer. -- Alain.
Re: Irreducible loops in generated code
On 18/08/2010 12:37, Steven Bosscher wrote: On Wed, Aug 18, 2010 at 11:53 AM, Alain Ketterlin wrote: My question is: where do these irreducible loops come from? Which optimization pass leads to irreducible regions? Thanks in advance for any pointer. Questions right back at you: What compiler version are you using? Got a specific exampe (test case)? Sure, sorry, forgot that: it's gcc 4.4.3 (the one that comes with Ubuntu Lucid). My test cases are SPEC programs. I haven't been able to extract a simple example, it usually happens on large functions with heavily nested loops. In older releases of GCC4, jump threading sometimes resulted in irreducible regions, but more recent GCCs carefully try to avoid them. Excellent news :-) OK, I'll try the new version and report back what I find. Thanks for the help. -- Alain.
jump after return
When compiling at -O2 and -O3 optimization, I've found several instances of routines ending with: retq jmp with no branch pointing to the jmp instruction. Is there any specific reason for such code? Is there a way to avoid this? This is gcc 4.4.5, as installed on Ubuntu 10.10, compiling the SPEC OMP and CPU-2006 benchmarks. Thanks in advance. -- Alain.