Gabriel Paubert <[email protected]> wrote:
> On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>> Hi,
>>
>> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the
>> following code (13 instructions using 57 bytes, plus 4 quadwords
>> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given:
>>
>> .text
>> 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2
>> 4: R_X86_64_PC32 .rdata
>> 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4
>> c: R_X86_64_PC32 .rdata
>> 10: 66 0f 28 d8 movapd %xmm0, %xmm3
>> 14: 66 0f 28 c8 movapd %xmm0, %xmm1
>> 18: 66 0f 54 da andpd %xmm2, %xmm3
>> 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4
>> 20: 76 16 jbe 38 <_trunc+0x38>
>> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax
>> 27: 66 0f ef c0 pxor %xmm0, %xmm0
>> 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2
>> 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0
>> 34: 66 0f 56 c2 orpd %xmm2, %xmm0
>> 38: c3 retq
>>
>> .rdata
>> .align 8
>> 0: 00 00 00 00 .LC0: .quad 0x1.0p52
>> 00 00 30 43
>> 00 00 00 00
>> 00 00 00 00
>> .align 16
>> 10: ff ff ff ff .LC1: .quad ~(-0.0)
>> ff ff ff 7f
>> 18: 00 00 00 00 .quad 0.0
>> 00 00 00 00
>> .end
>>
>> JFTR: in the best case, the memory accesses cost several cycles,
>> while in the worst case they yield a page fault!
>>
>>
>> Properly optimized, shorter and faster code, using but only 9 instructions
>> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory accesses
>> and saving at least 16 + 32 bytes, follows:
>>
>> .intel_syntax
>> .text
>> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument)
>> 5: 48 f7 d8 neg rax
>> # jz .L0 # argument zero?
>> 8: 70 16 jo .L0 # argument indefinite?
>> # argument overflows
>> 64-bit integer?
>> a: 48 f7 d8 neg rax
>> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument)
>> 12: 66 0f 73 d0 3f psrlq xmm0, 63
>> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument &
>> -0.0) ? -0.0 : 0.0
>> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument)
>> 20: c3 .L0: ret
>> .end
>
> There is one important difference, namely setting the invalid exception
> flag when the parameter can't be represented in a signed integer.
Right, I overlooked this fault. Thanks for pointing out.
> So using your code may require some option (-fast-math comes to mind),
> or you need at least a check on the exponent before cvttsd2si.
The whole idea behind these implementations is to get rid of loading
floating-point constants to perform comparisions.
> The last part of your code then goes to take into account the special
> case of -0.0, which I most often don't care about (I'd like to have a
> -fdont-split-hairs-about-the-sign-of-zero option).
Preserving the sign of -0.0 is explicitly specified in the standard,
and is cheap, as shown in my code.
> Potentially generating spurious invalid operation and then carefully
> taking into account the sign of zero does not seem very consistent.
>
> Apart from this, in your code, after cvttsd2si I'd rather use:
> mov rcx,rax # make a second copy to a scratch register
> neg rcx
> jo .L0
> cvtsi2sd xmm1,rax
I don't know how GCC generates the code for builtins, and what kind of
templates it uses: the second goal was to minimize register usage.
> The reason is latency, in an OoO engine, splitting the two paths is
> almost always a win.
>
> With your patch:
>
> cvttsd2si-->neg-?->neg-->cvtsi2sd
>
> where the ? means that the following instructions are speculated.
>
> With an auxiliary register there are two dependency chains:
>
> cvttsd2si-?->cvtsi2sd
> |->mov->neg->jump
Correct; see above: I expect the template(s) for builtins to give
the register allocator some freedom to split code paths and resolve
dependency chains.
> Actually some OoO cores just eliminate register copies using register
> renaming mechanism. But even this is probably completely irrelevant in
> this case where the latency is dominated by the two conversion
> instructions.
Right, the conversions dominate both the original and the code I posted.
It's easy to get rid of them, with still slightly shorter and faster
branchless code (17 instructions, 84 bytes, instead of 13 instructions,
57 + 32 = 89 bytes):
.code64
.intel_syntax noprefix
.text
0: 48 b8 00 00 00 00 00 00 30 43 mov rax, 0x4330000000000000
a: 66 48 0f 6e d0 movq xmm2, rax # xmm2 =
0x1.0p52 = 4503599627370496.0
f: 48 b8 00 00 00 00 00 00 f0 3f mov rax, 0x3FF0000000000000
19: f2 0f 10 c8 movsd xmm1, xmm0 # xmm1 =
argument
1d: 66 0f 73 f0 01 psllq xmm0, 1
22: 66 0f 73 d0 01 psrlq xmm0, 1 # xmm0 =
|argument|
27: 66 0f 73 d1 3f psrlq xmm1, 63
2c: 66 0f 73 f1 3f psllq xmm1, 63 # xmm1 =
(argument & -0.0) ? -0.0 : +0.0
31: f2 0f 10 d8 movsd xmm3, xmm0
35: f2 0f 58 c2 addsd xmm0, xmm2 # xmm0 =
|argument| + 0x1.0p52
39: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 =
|argument| - 0x1.0p52
# =
rint(|argument|)
3d: 66 48 0f 6e d0 movq xmm2, rax # xmm2 =
-0x1.0p0 = -1.0
42: f2 0f c2 d8 01 cmpltsd xmm3, xmm0 # xmm3 =
(|argument| < rint(|argument|)) ? ~0L : 0L
47: 66 0f 54 d3 andpd xmm2, xmm3 # xmm2 =
(|argument| < rint(|argument|)) ? 1.0 : 0.0
4b: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 =
rint(|argument|)
# -
(|argument| < rint(|argument|)) ? 1.0 : 0.0
# =
trunc(|argument|)
4f: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 =
trunc(argument)
53: c3 ret
.end
regards
Stefan