Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Stefan Kanthak Thu, 05 Aug 2021 05:05:25 -0700

Gabriel Paubert <[email protected]> wrote:


> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>> Hi,
>> 
>> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
>> following code (13 instructions using 57 bytes, plus 4 quadwords
>> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
>> 
>>                                 .text
>>    0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
>>                         4: R_X86_64_PC32        .rdata
>>    8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
>>                         c: R_X86_64_PC32        .rdata
>>   10:   66 0f 28 d8             movapd %xmm0, %xmm3
>>   14:   66 0f 28 c8             movapd %xmm0, %xmm1
>>   18:   66 0f 54 da             andpd  %xmm2, %xmm3
>>   1c:   66 0f 2e e3             ucomisd %xmm3, %xmm4
>>   20:   76 16                   jbe    38 <_trunc+0x38>
>>   22:   f2 48 0f 2c c0          cvttsd2si %xmm0, %rax
>>   27:   66 0f ef c0             pxor   %xmm0, %xmm0
>>   2b:   66 0f 55 d1             andnpd %xmm1, %xmm2
>>   2f:   f2 48 0f 2a c0          cvtsi2sd %rax, %xmm0
>>   34:   66 0f 56 c2             orpd   %xmm2, %xmm0
>>   38:   c3                      retq
>> 
>>                                 .rdata
>>                                 .align 8
>>    0:   00 00 00 00     .LC0:   .quad  0x1.0p52
>>         00 00 30 43
>>         00 00 00 00
>>         00 00 00 00
>>                                 .align 16
>>   10:   ff ff ff ff     .LC1:   .quad  ~(-0.0)
>>         ff ff ff 7f
>>   18:   00 00 00 00             .quad  0.0
>>         00 00 00 00
>>                                 .end
>> 
>> JFTR: in the best case, the memory accesses cost several cycles,
>>       while in the worst case they yield a page fault!
>> 
>> 
>> Properly optimized, shorter and faster code, using but only 9 instructions
>> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
>> and saving at least 16 + 32 bytes, follows:
>> 
>>                               .intel_syntax
>>                               .text
>>    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
>>    5:   48 f7 d8              neg     rax
>>                         #     jz      .L0          # argument zero?
>>    8:   70 16                 jo      .L0          # argument indefinite?
>>                                                    # argument overflows 
>> 64-bit integer?
>>    a:   48 f7 d8              neg     rax
>>    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
>>   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
>>   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & 
>> -0.0) ? -0.0 : 0.0
>>   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
>>   20:   c3              .L0:  ret
>>                               .end
> 
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer.

Right, I overlooked this fault. Thanks for pointing out.

> So using your code may require some option (-fast-math comes to mind),
> or you need at least a check on the exponent before cvttsd2si.

The whole idea behind these implementations is to get rid of loading
floating-point constants to perform comparisions.

> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).

Preserving the sign of -0.0 is explicitly specified in the standard,
and is cheap, as shown in my code.

> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
> 
> Apart from this, in your code, after cvttsd2si I'd rather use:
> mov rcx,rax # make a second copy to a scratch register
> neg rcx
> jo .L0
> cvtsi2sd xmm1,rax

I don't know how GCC generates the code for builtins, and what kind of
templates it uses: the second goal was to minimize register usage.

> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
> 
> With your patch:
> 
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>              
> where the ? means that the following instructions are speculated.  
> 
> With an auxiliary register there are two dependency chains:
> 
> cvttsd2si-?->cvtsi2sd
>         |->mov->neg->jump

Correct; see above: I expect the template(s) for builtins to give
the register allocator some freedom to split code paths and resolve
dependency chains.

> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.

Right, the conversions dominate both the original and the code I posted.
It's easy to get rid of them, with still slightly shorter and faster
branchless code (17 instructions, 84 bytes, instead of 13 instructions,
57 + 32 = 89 bytes):

                                        .code64
                                        .intel_syntax noprefix
                                        .text
   0:   48 b8 00 00 00 00 00 00 30 43   mov     rax, 0x4330000000000000
   a:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 
0x1.0p52 = 4503599627370496.0
   f:   48 b8 00 00 00 00 00 00 f0 3f   mov     rax, 0x3FF0000000000000
  19:   f2 0f 10 c8                     movsd   xmm1, xmm0       # xmm1 = 
argument
  1d:   66 0f 73 f0 01                  psllq   xmm0, 1
  22:   66 0f 73 d0 01                  psrlq   xmm0, 1          # xmm0 = 
|argument|
  27:   66 0f 73 d1 3f                  psrlq   xmm1, 63
  2c:   66 0f 73 f1 3f                  psllq   xmm1, 63         # xmm1 = 
(argument & -0.0) ? -0.0 : +0.0
  31:   f2 0f 10 d8                     movsd   xmm3, xmm0
  35:   f2 0f 58 c2                     addsd   xmm0, xmm2       # xmm0 = 
|argument| + 0x1.0p52
  39:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = 
|argument| - 0x1.0p52
                                                                 #      = 
rint(|argument|)
  3d:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 
-0x1.0p0 = -1.0
  42:   f2 0f c2 d8 01                  cmpltsd xmm3, xmm0       # xmm3 = 
(|argument| < rint(|argument|)) ? ~0L : 0L
  47:   66 0f 54 d3                     andpd   xmm2, xmm3       # xmm2 = 
(|argument| < rint(|argument|)) ? 1.0 : 0.0
  4b:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = 
rint(|argument|)
                                                                 #      - 
(|argument| < rint(|argument|)) ? 1.0 : 0.0
                                                                 #      = 
trunc(|argument|)
  4f:   66 0f 56 c1                     orpd    xmm0, xmm1       # xmm0 = 
trunc(argument)
  53:   c3                              ret
                                        .end

regards
Stefan

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Reply via email to