https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339
--- Comment #4 from Peter Cordes <peter at cordes dot ca> --- (In reply to Jakub Jelinek from comment #0) > At least on i7-5960X in the following testcase: > > baz is fastest as well as shortest. > So I think we should consider using movl $cst, %edx; shlq $shift, %rdx > instead of movabsq $(cst << shift), %rdx. > > Unfortunately I can't find in Agner Fog MOVABS and for MOV r64,i64 there is > too little information, so it is unclear on which CPUs it is beneficial. Agner uses Intel syntax, where imm64 doesn't have a special mnemonic. It's part of the mov r,i entry in the tables. But those tables are throughput for a flat sequence of the instruction repeated many times, not mixed with others where front-end effects can be different. Agner probably didn't actually test mov r64,imm64, because its throughput is different when tested in a long sequence (not in a small loop). According to http://users.atw.hu/instlatx64/GenuineIntel00506E3_Skylake2_InstLatX64.txt, a regular desktop Skylake has 0.64c throughput for mov r64, imm64, vs. 0.25 for mov r32, imm32. (They don't test mov r/m64, imm32, the 7-byte encoding for something like mov rax,-1). Skylake with up-to-date microcode (including all SKX CPUs) disables the loop buffer (LSD), and has to read uops from the uop cache every time even in short loops. Uop-cache effects could be a problem for instructions with a 64-bit immediate. Agner only did detailed testing for Sandybridge; it's likely that Skylake still mostly works the same (although the uop cache read bandwidth is higher). mov r64, imm64 takes 2 entries in the uop cache (because of the 64-bit immediate that's outside the signed 32-bit range), and takes 2 cycles to read from the uop cache, according to Agner's Table 9.1 in his microarch pdf. It can borrow space from another entry in the same uop cache line, but still takes extra cycles to read. See https://stackoverflow.com/questions/46433208/which-is-faster-imm64-or-m64-for-x86-64 for an SO question the other day about loading constants from memory vs. imm64. (Although I didn't have anything very wise to say there, just that it depends on surrounding code as always!) > Peter, any information on what the MOV r64,i64 latency/throughput on various > CPUs vs. MOV r32,i32; SHL r64,i8 is? When not bottlenecked on the front-end, mov r64,i64 is a single ALU uop with 1c latency. I think it's pretty much universal that it's the best choice when you bottleneck on anything else. Some loops *do* bottleneck on the front-end, though, especially without unrolling. But then it comes down to whether we have a uop-cache read bottleneck, or a decode bottleneck, or an issue bottleneck (4 fused-domain uops per clock renamed/issued). For issue/retire bandwidth mov/shl is 2 uops instead of 1. But for code that bottlenecks on reading the uop-cache, it's really hard to say if one is better in general. I think if the imm64 can borrow space in other uops in the cache line, it's better for uop-cache density than mov/shl. Unless the extra code-size means one fewer instruction fits into a uop cache line that wasn't nearly full (6 uops). Front-end stuff is *very* context-sensitive. :/ Calling a very short non-inline function from a tiny loop is probably making the uop-cache issues worse, and is probably favouring the mov/shift over the mov r64,imm64 approach more than you'd see as part of a larger contiguous block. I *think* mov r64,imm64 should still generally be preferred in most cases. Usually the issue queue (IDQ) between the uop cache and the issue/rename stage can absorb uop-cache read bubbles. A constant pool might be worth considering if code-size is getting huge (average instruction length much greater than 4). Normally of course you'd really want to hoist an imm64 out of a loop, if you have a spare register. When optimizing small loops, you can usually avoid front-end bottlenecks. It's a lot harder for medium-sized loops involving separate functions. I'm not confident this noinline case is very representative of real code. ------- Note that in this special case, you can save another byte of code by using ror rax (implicit by-one encoding). Also worth considering for tune=sandybridge or later: xor eax,eax / bts rax, 63. 2B + 5B = 7B. BTS has 0.5c throughput, and xor-zeroing doesn't need an ALU on SnB-family (so it has zero latency; the BTS can execute right away even if it issues in the same cycle as xor-zeroing). BTS runs on the same ports as shifts (p0/p6 in HSW+, or p0/p5 in SnB/IvB). On older Intel, it has 1 per clock throughput for the reg,imm form. On AMD, it's 2 uops, with 1c throughput (0.5c on Ryzen), so its not bad if used on AMD CPUs, but it doesn't look good for tune=generic. At -Os, you could consider or eax, -1; shl rax,63. (Also 7 bytes, and works for constants with multiple consecutive high-bits set). The false dependency on the old RAX value is often not a bottleneck, and gcc already uses OR with -1 for return -1; It's too bad there isn't an efficient 3-byte way to get small constants zero-extended into registers, like a mov r/m32, imm8 or something. That would make the code-size savings large enough to be worth considering multi-instruction stuff more often.