Hi, On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote: > Gabriel Paubert <paub...@iram.es> wrote: > > > > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote: > >> Hi, > >> > >> targeting AMD64 alias x86_64 with -O3, GCC 10.2.0 generates the > >> following code (13 instructions using 57 bytes, plus 4 quadwords > >> using 32 bytes) for __builtin_trunc() when -msse4.1 is NOT given: > >> > >> .text > >> 0: f2 0f 10 15 10 00 00 00 movsd .LC1(%rip), %xmm2 > >> 4: R_X86_64_PC32 .rdata > >> 8: f2 0f 10 25 00 00 00 00 movsd .LC0(%rip), %xmm4 > >> c: R_X86_64_PC32 .rdata > >> 10: 66 0f 28 d8 movapd %xmm0, %xmm3 > >> 14: 66 0f 28 c8 movapd %xmm0, %xmm1 > >> 18: 66 0f 54 da andpd %xmm2, %xmm3 > >> 1c: 66 0f 2e e3 ucomisd %xmm3, %xmm4 > >> 20: 76 16 jbe 38 <_trunc+0x38> > >> 22: f2 48 0f 2c c0 cvttsd2si %xmm0, %rax > >> 27: 66 0f ef c0 pxor %xmm0, %xmm0 > >> 2b: 66 0f 55 d1 andnpd %xmm1, %xmm2 > >> 2f: f2 48 0f 2a c0 cvtsi2sd %rax, %xmm0 > >> 34: 66 0f 56 c2 orpd %xmm2, %xmm0 > >> 38: c3 retq > >> > >> .rdata > >> .align 8 > >> 0: 00 00 00 00 .LC0: .quad 0x1.0p52 > >> 00 00 30 43 > >> 00 00 00 00 > >> 00 00 00 00 > >> .align 16 > >> 10: ff ff ff ff .LC1: .quad ~(-0.0) > >> ff ff ff 7f > >> 18: 00 00 00 00 .quad 0.0 > >> 00 00 00 00 > >> .end > >> > >> JFTR: in the best case, the memory accesses cost several cycles, > >> while in the worst case they yield a page fault! > >> > >> > >> Properly optimized, shorter and faster code, using but only 9 instructions > >> in just 33 bytes, WITHOUT any constants, thus avoiding costly memory > >> accesses > >> and saving at least 16 + 32 bytes, follows: > >> > >> .intel_syntax > >> .text > >> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument) > >> 5: 48 f7 d8 neg rax > >> # jz .L0 # argument zero? > >> 8: 70 16 jo .L0 # argument indefinite? > >> # argument overflows > >> 64-bit integer? > >> a: 48 f7 d8 neg rax > >> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = trunc(argument) > >> 12: 66 0f 73 d0 3f psrlq xmm0, 63 > >> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & > >> -0.0) ? -0.0 : 0.0 > >> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = trunc(argument) > >> 20: c3 .L0: ret > >> .end > > > > There is one important difference, namely setting the invalid exception > > flag when the parameter can't be represented in a signed integer. > > Right, I overlooked this fault. Thanks for pointing out. > > > So using your code may require some option (-fast-math comes to mind), > > or you need at least a check on the exponent before cvttsd2si. > > The whole idea behind these implementations is to get rid of loading > floating-point constants to perform comparisions.
Indeed, but what I had in mind was something along the following lines: movq rax,xmm0 # and copy rax to say rcx, if needed later shrq rax,52 # move sign and exponent to 12 LSBs andl eax,0x7ff # mask the sign cmpl eax,0x434 # value to be checked ja return # exponent too large, we're done (what about NaNs?) cvttsd2si rax,xmm0 # safe after exponent check cvtsi2sd xmm0,rax # conversion done and a bit more to handle the corner cases (essentially preserve the sign to be correct between -1 and -0.0). But the CPU can (speculatively) start the conversions early, so the dependency chain is rather short. I don't know if it's faster than your new code, I'm almost sure that it's shorter. Your new code also has a fairly long dependency chain. > > > The last part of your code then goes to take into account the special > > case of -0.0, which I most often don't care about (I'd like to have a > > -fdont-split-hairs-about-the-sign-of-zero option). > > Preserving the sign of -0.0 is explicitly specified in the standard, > and is cheap, as shown in my code. > > > Potentially generating spurious invalid operation and then carefully > > taking into account the sign of zero does not seem very consistent. > > > > Apart from this, in your code, after cvttsd2si I'd rather use: > > mov rcx,rax # make a second copy to a scratch register > > neg rcx > > jo .L0 > > cvtsi2sd xmm1,rax > > I don't know how GCC generates the code for builtins, and what kind of > templates it uses: the second goal was to minimize register usage. > Ok, but on 64 bit using two GPRs would still be reasonable. > > The reason is latency, in an OoO engine, splitting the two paths is > > almost always a win. > > > > With your patch: > > > > cvttsd2si-->neg-?->neg-->cvtsi2sd > > > > where the ? means that the following instructions are speculated. > > > > With an auxiliary register there are two dependency chains: > > > > cvttsd2si-?->cvtsi2sd > > |->mov->neg->jump > > Correct; see above: I expect the template(s) for builtins to give > the register allocator some freedom to split code paths and resolve > dependency chains. > > > Actually some OoO cores just eliminate register copies using register > > renaming mechanism. But even this is probably completely irrelevant in > > this case where the latency is dominated by the two conversion > > instructions. > > Right, the conversions dominate both the original and the code I posted. > It's easy to get rid of them, with still slightly shorter and faster > branchless code (17 instructions, 84 bytes, instead of 13 instructions, > 57 + 32 = 89 bytes): > > .code64 > .intel_syntax noprefix > .text > 0: 48 b8 00 00 00 00 00 00 30 43 mov rax, 0x4330000000000000 > a: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = > 0x1.0p52 = 4503599627370496.0 > f: 48 b8 00 00 00 00 00 00 f0 3f mov rax, 0x3FF0000000000000 > 19: f2 0f 10 c8 movsd xmm1, xmm0 # xmm1 = > argument > 1d: 66 0f 73 f0 01 psllq xmm0, 1 > 22: 66 0f 73 d0 01 psrlq xmm0, 1 # xmm0 = > |argument| > 27: 66 0f 73 d1 3f psrlq xmm1, 63 > 2c: 66 0f 73 f1 3f psllq xmm1, 63 # xmm1 = > (argument & -0.0) ? -0.0 : +0.0 > 31: f2 0f 10 d8 movsd xmm3, xmm0 > 35: f2 0f 58 c2 addsd xmm0, xmm2 # xmm0 = > |argument| + 0x1.0p52 > 39: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = > |argument| - 0x1.0p52 > # = > rint(|argument|) > 3d: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = > -0x1.0p0 = -1.0 Huh? I see +1.0, -1 would be 0xBFF0000000000000. > 42: f2 0f c2 d8 01 cmpltsd xmm3, xmm0 # xmm3 = > (|argument| < rint(|argument|)) ? ~0L : 0L > 47: 66 0f 54 d3 andpd xmm2, xmm3 # xmm2 = > (|argument| < rint(|argument|)) ? 1.0 : 0.0 > 4b: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = > rint(|argument|) > # - > (|argument| < rint(|argument|)) ? 1.0 : 0.0 > # = > trunc(|argument|) > 4f: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = > trunc(argument) > 53: c3 ret > .end > > regards > Stefan Regards, Gabriel