Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Gabriel Paubert Thu, 05 Aug 2021 06:59:30 -0700

        Hi,

On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
> Gabriel Paubert <paub...@iram.es> wrote:
> 
> 
> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
> >> Hi,
> >> 
> >> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
> >> following code (13 instructions using 57 bytes, plus 4 quadwords
> >> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
> >> 
> >>                                 .text
> >>    0:   f2 0f 10 15 10 00 00 00 movsd  .LC1(%rip), %xmm2
> >>                         4: R_X86_64_PC32        .rdata
> >>    8:   f2 0f 10 25 00 00 00 00 movsd  .LC0(%rip), %xmm4
> >>                         c: R_X86_64_PC32        .rdata
> >>   10:   66 0f 28 d8             movapd %xmm0, %xmm3
> >>   14:   66 0f 28 c8             movapd %xmm0, %xmm1
> >>   18:   66 0f 54 da             andpd  %xmm2, %xmm3
> >>   1c:   66 0f 2e e3             ucomisd %xmm3, %xmm4
> >>   20:   76 16                   jbe    38 <_trunc+0x38>
> >>   22:   f2 48 0f 2c c0          cvttsd2si %xmm0, %rax
> >>   27:   66 0f ef c0             pxor   %xmm0, %xmm0
> >>   2b:   66 0f 55 d1             andnpd %xmm1, %xmm2
> >>   2f:   f2 48 0f 2a c0          cvtsi2sd %rax, %xmm0
> >>   34:   66 0f 56 c2             orpd   %xmm2, %xmm0
> >>   38:   c3                      retq
> >> 
> >>                                 .rdata
> >>                                 .align 8
> >>    0:   00 00 00 00     .LC0:   .quad  0x1.0p52
> >>         00 00 30 43
> >>         00 00 00 00
> >>         00 00 00 00
> >>                                 .align 16
> >>   10:   ff ff ff ff     .LC1:   .quad  ~(-0.0)
> >>         ff ff ff 7f
> >>   18:   00 00 00 00             .quad  0.0
> >>         00 00 00 00
> >>                                 .end
> >> 
> >> JFTR: in the best case, the memory accesses cost several cycles,
> >>       while in the worst case they yield a page fault!
> >> 
> >> 
> >> Properly optimized, shorter and faster code, using but only 9 instructions
> >> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory 
> >> accesses
> >> and saving at least 16 + 32 bytes, follows:
> >> 
> >>                               .intel_syntax
> >>                               .text
> >>    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
> >>    5:   48 f7 d8              neg     rax
> >>                         #     jz      .L0          # argument zero?
> >>    8:   70 16                 jo      .L0          # argument indefinite?
> >>                                                    # argument overflows 
> >> 64-bit integer?
> >>    a:   48 f7 d8              neg     rax
> >>    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
> >>   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
> >>   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & 
> >> -0.0) ? -0.0 : 0.0
> >>   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = trunc(argument)
> >>   20:   c3              .L0:  ret
> >>                               .end
> > 
> > There is one important difference, namely setting the invalid exception
> > flag when the parameter can't be represented in a signed integer.
> 
> Right, I overlooked this fault. Thanks for pointing out.
> 
> > So using your code may require some option (-fast-math comes to mind),
> > or you need at least a check on the exponent before cvttsd2si.
> 
> The whole idea behind these implementations is to get rid of loading
> floating-point constants to perform comparisions.


Indeed, but what I had in mind was something along the following lines:

        movq rax,xmm0   # and copy rax to say rcx, if needed later
        shrq rax,52     # move sign and exponent to 12 LSBs 
        andl eax,0x7ff  # mask the sign
        cmpl eax,0x434  # value to be checked
        ja return       # exponent too large, we're done (what about NaNs?)
        cvttsd2si rax,xmm0 # safe after exponent check
        cvtsi2sd xmm0,rax  # conversion done

and a bit more to handle the corner cases (essentially preserve the
sign to be correct between -1 and -0.0). But the CPU can (speculatively) 
start the conversions early, so the dependency chain is rather short.

I don't know if it's faster than your new code, I'm almost sure that
it's shorter. Your new code also has a fairly long dependency chain.

> 
> > The last part of your code then goes to take into account the special
> > case of -0.0, which I most often don't care about (I'd like to have a
> > -fdont-split-hairs-about-the-sign-of-zero option).
> 
> Preserving the sign of -0.0 is explicitly specified in the standard,
> and is cheap, as shown in my code.
> 
> > Potentially generating spurious invalid operation and then carefully
> > taking into account the sign of zero does not seem very consistent.
> > 
> > Apart from this, in your code, after cvttsd2si I'd rather use:
> > mov rcx,rax # make a second copy to a scratch register
> > neg rcx
> > jo .L0
> > cvtsi2sd xmm1,rax
> 
> I don't know how GCC generates the code for builtins, and what kind of
> templates it uses: the second goal was to minimize register usage.
> 

Ok, but on 64 bit using two GPRs would still be reasonable.

> > The reason is latency, in an OoO engine, splitting the two paths is
> > almost always a win.
> > 
> > With your patch:
> > 
> > cvttsd2si-->neg-?->neg-->cvtsi2sd
> >              
> > where the ? means that the following instructions are speculated.  
> > 
> > With an auxiliary register there are two dependency chains:
> > 
> > cvttsd2si-?->cvtsi2sd
> >         |->mov->neg->jump
> 
> Correct; see above: I expect the template(s) for builtins to give
> the register allocator some freedom to split code paths and resolve
> dependency chains.
> 
> > Actually some OoO cores just eliminate register copies using register
> > renaming mechanism. But even this is probably completely irrelevant in
> > this case where the latency is dominated by the two conversion
> > instructions.
> 
> Right, the conversions dominate both the original and the code I posted.
> It's easy to get rid of them, with still slightly shorter and faster
> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
> 57 + 32 = 89 bytes):
> 
>                                         .code64
>                                         .intel_syntax noprefix
>                                         .text
>    0:   48 b8 00 00 00 00 00 00 30 43   mov     rax, 0x4330000000000000
>    a:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 
> 0x1.0p52 = 4503599627370496.0
>    f:   48 b8 00 00 00 00 00 00 f0 3f   mov     rax, 0x3FF0000000000000
>   19:   f2 0f 10 c8                     movsd   xmm1, xmm0       # xmm1 = 
> argument
>   1d:   66 0f 73 f0 01                  psllq   xmm0, 1
>   22:   66 0f 73 d0 01                  psrlq   xmm0, 1          # xmm0 = 
> |argument|
>   27:   66 0f 73 d1 3f                  psrlq   xmm1, 63
>   2c:   66 0f 73 f1 3f                  psllq   xmm1, 63         # xmm1 = 
> (argument & -0.0) ? -0.0 : +0.0
>   31:   f2 0f 10 d8                     movsd   xmm3, xmm0
>   35:   f2 0f 58 c2                     addsd   xmm0, xmm2       # xmm0 = 
> |argument| + 0x1.0p52
>   39:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = 
> |argument| - 0x1.0p52
>                                                                  #      = 
> rint(|argument|)
>   3d:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 
> -0x1.0p0 = -1.0

Huh? I see +1.0, -1 would be 0xBFF0000000000000.

>   42:   f2 0f c2 d8 01                  cmpltsd xmm3, xmm0       # xmm3 = 
> (|argument| < rint(|argument|)) ? ~0L : 0L
>   47:   66 0f 54 d3                     andpd   xmm2, xmm3       # xmm2 = 
> (|argument| < rint(|argument|)) ? 1.0 : 0.0
>   4b:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = 
> rint(|argument|)
>                                                                  #      - 
> (|argument| < rint(|argument|)) ? 1.0 : 0.0
>                                                                  #      = 
> trunc(|argument|)
>   4f:   66 0f 56 c1                     orpd    xmm0, xmm1       # xmm0 = 
> trunc(argument)
>   53:   c3                              ret
>                                         .end
> 
> regards
> Stefan

        Regards,
        Gabriel

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Reply via email to