Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Stefan Kanthak Fri, 06 Aug 2021 05:46:50 -0700

Gabriel Paubert <paub...@iram.es> wrote:

> Hi,
> 
> On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote:
>> Gabriel Paubert <paub...@iram.es> wrote:
>> 
>> 
>> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:


>> >>                               .intel_syntax
>> >>                               .text
>> >>    0:   f2 48 0f 2c c0        cvttsd2si rax, xmm0  # rax = trunc(argument)
>> >>    5:   48 f7 d8              neg     rax
>> >>                         #     jz      .L0          # argument zero?
>> >>    8:   70 16                 jo      .L0          # argument indefinite?
>> >>                                                    # argument overflows 
>> >> 64-bit integer?
>> >>    a:   48 f7 d8              neg     rax
>> >>    d:   f2 48 0f 2a c8        cvtsi2sd xmm1, rax   # xmm1 = 
>> >> trunc(argument)
>> >>   12:   66 0f 73 d0 3f        psrlq   xmm0, 63
>> >>   17:   66 0f 73 f0 3f        psllq   xmm0, 63     # xmm0 = (argument & 
>> >> -0.0) ? -0.0 : 0.0
>> >>   1c:   66 0f 56 c1           orpd    xmm0, xmm1   # xmm0 = 
>> >> trunc(argument)
>> >>   20:   c3              .L0:  ret
>> >>                               .end
>> > 
>> > There is one important difference, namely setting the invalid exception
>> > flag when the parameter can't be represented in a signed integer.
>> 
>> Right, I overlooked this fault. Thanks for pointing out.
>> 
>> > So using your code may require some option (-fast-math comes to mind),
>> > or you need at least a check on the exponent before cvttsd2si.
>> 
>> The whole idea behind these implementations is to get rid of loading
>> floating-point constants to perform comparisions.
> 
> Indeed, but what I had in mind was something along the following lines:
> 
> movq rax,xmm0   # and copy rax to say rcx, if needed later
> shrq rax,52     # move sign and exponent to 12 LSBs 
> andl eax,0x7ff  # mask the sign
> cmpl eax,0x434  # value to be checked
> ja return       # exponent too large, we're done (what about NaNs?)
> cvttsd2si rax,xmm0 # safe after exponent check
> cvtsi2sd xmm0,rax  # conversion done
> 
> and a bit more to handle the corner cases (essentially preserve the
> sign to be correct between -1 and -0.0).

The sign of -0.0 is the only corner case and already handled in my code.
Both SNAN and QNAN (which have an exponent 0x7ff) are handled and
preserved, as in the code GCC generates as well as my code.

> But the CPU can (speculatively) start the conversions early, so the
> dependency chain is rather short.

Correct.
 
> I don't know if it's faster than your new code,

It should be faster.

> I'm almost sure that it's shorter.

"neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but
5+4+5+5+2=21 bytes.

JFTR: better use "add rax,rax; shr rax,53" instead of
      "shr rax,52; and eax,0x7ff" and save 2 bytes.

Complete properly optimized code for __builtin_trunc is then as follows
(11 instructions, 44 bytes):

.code64
.intel_syntax
.equ    BIAS, 1023
.text
        movq    rax, xmm0    # rax = argument
        add     rax, rax
        shr     rax, 53      # rax = exponent of |argument|
        cmp     eax, BIAS + 53
        jae     .Lexit       # argument indefinite?
                             # |argument| >= 0x1.0p53?
        cvttsd2si rax, xmm0  # rax = trunc(argument)
        cvtsi2sd xmm1, rax   # xmm1 = trunc(argument)
        psrlq   xmm0, 63
        psllq   xmm0, 63     # xmm0 = (argument & -0.0) ? -0.0 : 0.0
        orpd    xmm0, xmm1   # xmm0 = trunc(argument)
.L0:    ret
.end

@Richard Biener (et. al.):

1. Is a primitive for "floating-point > 2**x", which generates such
   an "integer" code sequence, already available, at least for
   float/binary32 and double/binary64?

2. the procedural code generator for __builtin_trunc() etc.  uses
   __builtin_fabs() and __builtin_copysign() as building blocks.
   These would need to (and of course should) be modified to generate
   psllq/psrlq pairs instead of andpd/andnpd referencing a memory
   location with either -0.0 oder ~(-0.0).

For -ffast-math, where the sign of -0.0 is not handled and the spurios
invalid floating-point exception for |argument| >= 2**63 is acceptable,
it boils down to:

.code64
.intel_syntax
.equ    BIAS, 1023
.text
        cvttsd2si rax, xmm0  # rax = trunc(argument)
        jo      .Lexit       # argument indefinite?
                             # |argument| > 0x1.0p63?
        cvtsi2sd xmm0, rax   # xmm1 = trunc(argument)
.L0:    ret
.end

[...]

>> Right, the conversions dominate both the original and the code I posted.
>> It's easy to get rid of them, with still slightly shorter and faster
>> branchless code (17 instructions, 84 bytes, instead of 13 instructions,
>> 57 + 32 = 89 bytes):
>> 
>>                                         .code64
>>                                         .intel_syntax noprefix
>>                                         .text
>>    0:   48 b8 00 00 00 00 00 00 30 43   mov     rax, 0x4330000000000000
>>    a:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 
>> 0x1.0p52 = 4503599627370496.0
>>    f:   48 b8 00 00 00 00 00 00 f0 3f   mov     rax, 0x3FF0000000000000
>>   19:   f2 0f 10 c8                     movsd   xmm1, xmm0       # xmm1 = 
>> argument
>>   1d:   66 0f 73 f0 01                  psllq   xmm0, 1
>>   22:   66 0f 73 d0 01                  psrlq   xmm0, 1          # xmm0 = 
>> |argument|
>>   27:   66 0f 73 d1 3f                  psrlq   xmm1, 63
>>   2c:   66 0f 73 f1 3f                  psllq   xmm1, 63         # xmm1 = 
>> (argument & -0.0) ? -0.0 : +0.0
>>   31:   f2 0f 10 d8                     movsd   xmm3, xmm0
>>   35:   f2 0f 58 c2                     addsd   xmm0, xmm2       # xmm0 = 
>> |argument| + 0x1.0p52
>>   39:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = 
>> |argument| - 0x1.0p52
>>                                                                  #      = 
>> rint(|argument|)
>>   3d:   66 48 0f 6e d0                  movq    xmm2, rax        # xmm2 = 
>> -0x1.0p0 = -1.0
> 
> Huh? I see +1.0, -1 would be 0xBFF0000000000000.

Spurious error in the comment.
I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, xmm2"
instead of "subsd xmm0, xmm2" to save a "movsd" instruction.

>>   42:   f2 0f c2 d8 01                  cmpltsd xmm3, xmm0       # xmm3 = 
>> (|argument| < rint(|argument|)) ? ~0L : 0L
>>   47:   66 0f 54 d3                     andpd   xmm2, xmm3       # xmm2 = 
>> (|argument| < rint(|argument|)) ? 1.0 : 0.0
>>   4b:   f2 0f 5c c2                     subsd   xmm0, xmm2       # xmm0 = 
>> rint(|argument|)
>>                                                                  #      - 
>> (|argument| < rint(|argument|)) ? 1.0 : 0.0
>>                                                                  #      = 
>> trunc(|argument|)
>>   4f:   66 0f 56 c1                     orpd    xmm0, xmm1       # xmm0 = 
>> trunc(argument)
>>   53:   c3                              ret

regards
Stefan

Re: Suboptimal code generated for __buitlin_trunc on AMD64 without SS4_4.1

Reply via email to