Missed optimisation in __udivmoddi4 of libgcc2

Stefan Kanthak Sun, 13 Sep 2020 06:42:18 -0700

libgcc2 provides "double-word" division as __udivmoddi4()

The following part of its source


|   UWtype d0, d1, n0, n1, n2;
|   UWtype b, bm;
...
|   count_leading_zeros (bm, d1);
|   if (bm == 0)
...
|   else
|     {
|       UWtype m1, m0;
|       /* Normalize.  */
|
|       b = W_TYPE_SIZE - bm;
|
|       d1 = (d1 << bm) | (d0 >> b);
|       d0 = d0 << bm;
|       n2 = n1 >> b;
|       n1 = (n1 << bm) | (n0 >> b);
|       n0 = n0 << bm;

compiles for AMD64 to this rather clumsy code:

|  b0: 4c 0f bd da           bsr    %rdx,%r11
|  b4: 49 83 f3 3f           xor    $0x3f,%r11
|  b8: 45 85 db              test   %r11d,%r11d
|  bb: 75 33                 jne    f0 <__udivmodti4+0xf0>
...
|  f0: 49 63 c3              movslq %r11d,%rax
|  f3: bd 40 00 00 00        mov    $0x40,%ebp
|  f8: 44 89 d9              mov    %r11d,%ecx
|  fb: 4d 89 ec              mov    %r13,%r12
|  fe: 48 29 c5              sub    %rax,%rbp
| 101: 48 d3 e2              shl    %cl,%rdx
| 104: 49 89 f2              mov    %rsi,%r10
| 107: 48 89 f8              mov    %rdi,%rax
| 10a: 89 e9                 mov    %ebp,%ecx
| 10c: 44 89 db              mov    %r11d,%ebx
| 10f: 49 d3 ec              shr    %cl,%r12
| 112: 44 89 d9              mov    %r11d,%ecx
| 115: 49 d3 e5              shl    %cl,%r13
| 118: 89 e9                 mov    %ebp,%ecx
| 11a: 49 09 d4              or     %rdx,%r12
| 11d: 49 d3 ea              shr    %cl,%r10
| 120: 44 89 d9              mov    %r11d,%ecx
| 123: 48 d3 e6              shl    %cl,%rsi
| 126: 89 e9                 mov    %ebp,%ecx
| 128: 4c 89 d2              mov    %r10,%rdx
| 12b: 48 d3 e8              shr    %cl,%rax
| 12e: 44 89 d9              mov    %r11d,%ecx
| 131: 48 09 c6              or     %rax,%rsi
| 134: 48 d3 e7              shl    %cl,%rdi
| 137: 48 89 f0              mov    %rsi,%rax
| 13a: 49 89 f9              mov    %rdi,%r9

Why does the optimiser not recognize this pattern of alternating shifts
and generate MUCH shorter and of course faster code like the following?
(I use the variable names from the C source instead of register names
here)

    mov   %r11d, %ecx
    shld  %cl, d0, d1
    xor   n2, n2
    shld  %cl, n1, n2
    shld  %cl, n0, n1


JFTR: the test at b8 is superfluous.

regards
Stefan

Missed optimisation in __udivmoddi4 of libgcc2

Reply via email to