libgcc2 provides "double-word" division as __udivmoddi4() The following part of its source
| UWtype d0, d1, n0, n1, n2; | UWtype b, bm; ... | count_leading_zeros (bm, d1); | if (bm == 0) ... | else | { | UWtype m1, m0; | /* Normalize. */ | | b = W_TYPE_SIZE - bm; | | d1 = (d1 << bm) | (d0 >> b); | d0 = d0 << bm; | n2 = n1 >> b; | n1 = (n1 << bm) | (n0 >> b); | n0 = n0 << bm; compiles for AMD64 to this rather clumsy code: | b0: 4c 0f bd da bsr %rdx,%r11 | b4: 49 83 f3 3f xor $0x3f,%r11 | b8: 45 85 db test %r11d,%r11d | bb: 75 33 jne f0 <__udivmodti4+0xf0> ... | f0: 49 63 c3 movslq %r11d,%rax | f3: bd 40 00 00 00 mov $0x40,%ebp | f8: 44 89 d9 mov %r11d,%ecx | fb: 4d 89 ec mov %r13,%r12 | fe: 48 29 c5 sub %rax,%rbp | 101: 48 d3 e2 shl %cl,%rdx | 104: 49 89 f2 mov %rsi,%r10 | 107: 48 89 f8 mov %rdi,%rax | 10a: 89 e9 mov %ebp,%ecx | 10c: 44 89 db mov %r11d,%ebx | 10f: 49 d3 ec shr %cl,%r12 | 112: 44 89 d9 mov %r11d,%ecx | 115: 49 d3 e5 shl %cl,%r13 | 118: 89 e9 mov %ebp,%ecx | 11a: 49 09 d4 or %rdx,%r12 | 11d: 49 d3 ea shr %cl,%r10 | 120: 44 89 d9 mov %r11d,%ecx | 123: 48 d3 e6 shl %cl,%rsi | 126: 89 e9 mov %ebp,%ecx | 128: 4c 89 d2 mov %r10,%rdx | 12b: 48 d3 e8 shr %cl,%rax | 12e: 44 89 d9 mov %r11d,%ecx | 131: 48 09 c6 or %rax,%rsi | 134: 48 d3 e7 shl %cl,%rdi | 137: 48 89 f0 mov %rsi,%rax | 13a: 49 89 f9 mov %rdi,%r9 Why does the optimiser not recognize this pattern of alternating shifts and generate MUCH shorter and of course faster code like the following? (I use the variable names from the C source instead of register names here) mov %r11d, %ecx shld %cl, d0, d1 xor n2, n2 shld %cl, n1, n2 shld %cl, n0, n1 JFTR: the test at b8 is superfluous. regards Stefan