Gabriel Paubert <paub...@iram.es> wrote: > Hi, > > On Thu, Aug 05, 2021 at 01:58:12PM +0200, Stefan Kanthak wrote: >> Gabriel Paubert <paub...@iram.es> wrote: >> >> >> > On Thu, Aug 05, 2021 at 09:25:02AM +0200, Stefan Kanthak wrote:
>> >> .intel_syntax >> >> .text >> >> 0: f2 48 0f 2c c0 cvttsd2si rax, xmm0 # rax = trunc(argument) >> >> 5: 48 f7 d8 neg rax >> >> # jz .L0 # argument zero? >> >> 8: 70 16 jo .L0 # argument indefinite? >> >> # argument overflows >> >> 64-bit integer? >> >> a: 48 f7 d8 neg rax >> >> d: f2 48 0f 2a c8 cvtsi2sd xmm1, rax # xmm1 = >> >> trunc(argument) >> >> 12: 66 0f 73 d0 3f psrlq xmm0, 63 >> >> 17: 66 0f 73 f0 3f psllq xmm0, 63 # xmm0 = (argument & >> >> -0.0) ? -0.0 : 0.0 >> >> 1c: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = >> >> trunc(argument) >> >> 20: c3 .L0: ret >> >> .end >> > >> > There is one important difference, namely setting the invalid exception >> > flag when the parameter can't be represented in a signed integer. >> >> Right, I overlooked this fault. Thanks for pointing out. >> >> > So using your code may require some option (-fast-math comes to mind), >> > or you need at least a check on the exponent before cvttsd2si. >> >> The whole idea behind these implementations is to get rid of loading >> floating-point constants to perform comparisions. > > Indeed, but what I had in mind was something along the following lines: > > movq rax,xmm0 # and copy rax to say rcx, if needed later > shrq rax,52 # move sign and exponent to 12 LSBs > andl eax,0x7ff # mask the sign > cmpl eax,0x434 # value to be checked > ja return # exponent too large, we're done (what about NaNs?) > cvttsd2si rax,xmm0 # safe after exponent check > cvtsi2sd xmm0,rax # conversion done > > and a bit more to handle the corner cases (essentially preserve the > sign to be correct between -1 and -0.0). The sign of -0.0 is the only corner case and already handled in my code. Both SNAN and QNAN (which have an exponent 0x7ff) are handled and preserved, as in the code GCC generates as well as my code. > But the CPU can (speculatively) start the conversions early, so the > dependency chain is rather short. Correct. > I don't know if it's faster than your new code, It should be faster. > I'm almost sure that it's shorter. "neg rax; jo ...; neg rax" is 3+2+3=8 bytes, the above sequence has but 5+4+5+5+2=21 bytes. JFTR: better use "add rax,rax; shr rax,53" instead of "shr rax,52; and eax,0x7ff" and save 2 bytes. Complete properly optimized code for __builtin_trunc is then as follows (11 instructions, 44 bytes): .code64 .intel_syntax .equ BIAS, 1023 .text movq rax, xmm0 # rax = argument add rax, rax shr rax, 53 # rax = exponent of |argument| cmp eax, BIAS + 53 jae .Lexit # argument indefinite? # |argument| >= 0x1.0p53? cvttsd2si rax, xmm0 # rax = trunc(argument) cvtsi2sd xmm1, rax # xmm1 = trunc(argument) psrlq xmm0, 63 psllq xmm0, 63 # xmm0 = (argument & -0.0) ? -0.0 : 0.0 orpd xmm0, xmm1 # xmm0 = trunc(argument) .L0: ret .end @Richard Biener (et. al.): 1. Is a primitive for "floating-point > 2**x", which generates such an "integer" code sequence, already available, at least for float/binary32 and double/binary64? 2. the procedural code generator for __builtin_trunc() etc. uses __builtin_fabs() and __builtin_copysign() as building blocks. These would need to (and of course should) be modified to generate psllq/psrlq pairs instead of andpd/andnpd referencing a memory location with either -0.0 oder ~(-0.0). For -ffast-math, where the sign of -0.0 is not handled and the spurios invalid floating-point exception for |argument| >= 2**63 is acceptable, it boils down to: .code64 .intel_syntax .equ BIAS, 1023 .text cvttsd2si rax, xmm0 # rax = trunc(argument) jo .Lexit # argument indefinite? # |argument| > 0x1.0p63? cvtsi2sd xmm0, rax # xmm1 = trunc(argument) .L0: ret .end [...] >> Right, the conversions dominate both the original and the code I posted. >> It's easy to get rid of them, with still slightly shorter and faster >> branchless code (17 instructions, 84 bytes, instead of 13 instructions, >> 57 + 32 = 89 bytes): >> >> .code64 >> .intel_syntax noprefix >> .text >> 0: 48 b8 00 00 00 00 00 00 30 43 mov rax, 0x4330000000000000 >> a: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = >> 0x1.0p52 = 4503599627370496.0 >> f: 48 b8 00 00 00 00 00 00 f0 3f mov rax, 0x3FF0000000000000 >> 19: f2 0f 10 c8 movsd xmm1, xmm0 # xmm1 = >> argument >> 1d: 66 0f 73 f0 01 psllq xmm0, 1 >> 22: 66 0f 73 d0 01 psrlq xmm0, 1 # xmm0 = >> |argument| >> 27: 66 0f 73 d1 3f psrlq xmm1, 63 >> 2c: 66 0f 73 f1 3f psllq xmm1, 63 # xmm1 = >> (argument & -0.0) ? -0.0 : +0.0 >> 31: f2 0f 10 d8 movsd xmm3, xmm0 >> 35: f2 0f 58 c2 addsd xmm0, xmm2 # xmm0 = >> |argument| + 0x1.0p52 >> 39: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = >> |argument| - 0x1.0p52 >> # = >> rint(|argument|) >> 3d: 66 48 0f 6e d0 movq xmm2, rax # xmm2 = >> -0x1.0p0 = -1.0 > > Huh? I see +1.0, -1 would be 0xBFF0000000000000. Spurious error in the comment. I modified code which uses -1.0 and performs (a commutative) "addsd xmm2, xmm2" instead of "subsd xmm0, xmm2" to save a "movsd" instruction. >> 42: f2 0f c2 d8 01 cmpltsd xmm3, xmm0 # xmm3 = >> (|argument| < rint(|argument|)) ? ~0L : 0L >> 47: 66 0f 54 d3 andpd xmm2, xmm3 # xmm2 = >> (|argument| < rint(|argument|)) ? 1.0 : 0.0 >> 4b: f2 0f 5c c2 subsd xmm0, xmm2 # xmm0 = >> rint(|argument|) >> # - >> (|argument| < rint(|argument|)) ? 1.0 : 0.0 >> # = >> trunc(|argument|) >> 4f: 66 0f 56 c1 orpd xmm0, xmm1 # xmm0 = >> trunc(argument) >> 53: c3 ret regards Stefan