[Bug c++/35669] NULL (__null) not considered different from 0 with C++
--- Comment #8 from peter at cordes dot ca 2009-02-12 17:56 --- Would it cause any problems for g++ to behave more like a C compiler when it comes to NULL? e.g. I found this bug report after finding that kscope 1.9.1 didn't compile, because it expected NULL to match the void* version of an overloaded function. locationlistmodel.cpp:204: error: call of overloaded createIndex(int&, int&, NULL) is ambiguous .../qabstractitemmodel.h:288: note: candidates are: QModelIndex QAbstractItemModel::createIndex(int, int, void*) const .../qabstractitemmodel.h:290: note: QModelIndex QAbstractItemModel::createIndex(int, int, int) const .../qabstractitemmodel.h:299: note: QModelIndex QAbstractItemModel::createIndex(int, int, quint32) const This was in released alpha code (http://qt-apps.org/content/show.php?content=96992) so presumably it built ok on some compiler. (Although maybe it used to just pick one of the int overloads, if that's what Lubos was talking about having to debug.) As a mostly C programmer, this just seems like something stupid in the standard, and the sort of behaviour you should only get with -std=c++0x, but not -std=gnu++0x. As everyone else is saying, who in their right mind actually wants this behaviour? And more importantly, would changing it ever make g++ actually mis-compile anything? (not counting compiling stuff like kscope, or test.c below, where the result "should" be an error message, not a binary) Anyway, NULL should be a void*, damn it. Yes, I'm a C programmer. -- peter at cordes dot ca changed: What|Removed |Added CC| |peter at cordes dot ca http://gcc.gnu.org/bugzilla/show_bug.cgi?id=35669
[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #53 from Peter Cordes --- I think we can close this as fixed at some point. The last activity on this bug was some patches that sound like they were supposed to fix, and the MCVEs from comments I tested no longer has a problem. GCC9.3 -O3 -march=core2 -fomit-frame-pointer only uses a `.p2align` to align the top of the loop, not between leave and ret or between cmp/jcc. void wait_for_enter() { volatile int foo = 0; // to get a LEAVE instruction emitted at all int u = getchar(); while (!u) u = getchar()-13; } https://godbolt.org/z/RvxzZv (Note that Godbolt normally filters .p2align so you have to either compile to binary or not filter directives in the asm source. Otherwise you'll never see NOPs except in the unusual case where GCC actually emits a nop mnemonic.)
[Bug tree-optimization/92243] New: Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243 Bug ID: 92243 Summary: Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* We could use integer bswap to speed up an in-place byte-reverse loop by a factor of probably 8, the same way we uses SIMD shuffles. Consider this loop which reverses an explicit-length char array: https://godbolt.org/z/ujXq_J typedef char swapt; // int can auto-vectorize with just SSE2 void strrev_explicit(swapt *head, long len) { swapt *tail = head + len - 1; for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } gcc -O3 (including current trunk) targeting x86-64 makes naive scalar byte-at-a-time code, even though bswap r64 is available to byte-reverse a uint64 in 1 or 2 uops (AMD and Intel, respectively). With -mssse3, we do see auto-vectorization using SIMD pshufb (after checking lengths and calculating how many 16-byte chunks can be done before bloated fully-unrolled cleanup). Doing the same thing with 64-bit integer registers would be very much worth it (for code where a loop like this was a bottleneck). With `swapt = short`, vectorizing with SSE2 pshuflw / pshufhw / pshufd is probably worth it, but GCC chooses not to do that either. Or working in 8-byte chunks just using movq + pshuflw, so we only have 1 shuffle per 8-byte load/store instead of 3 per 16-byte store. That's a good balance for modern Intel (Haswell, Skylake, and I think IceLake), although some AMD and earlier Intel with more integer shuffle throughput (e.g. Sandybridge) might do better with 3x shuffles per 16-byte load/store.
[Bug tree-optimization/92243] Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243 --- Comment #1 from Peter Cordes --- Forgot to mention, this probably applies to other ISAs with GP-integer byte-reverse instructions and efficient unaligned loads.
[Bug tree-optimization/92244] New: extra sub inside vectorized loop instead of calculating end-pointer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 Bug ID: 92244 Summary: extra sub inside vectorized loop instead of calculating end-pointer Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- We get a redundant instruction inside the vectorized loop here. But it's not a separate *counter*, it's a duplicate of the tail pointer. It goes away if we find tail with while(*tail++); instead of calculating it from head+length. Only happens with vectorization, not pure scalar (bug 92243 is about the fact that -O3 fails to use bswap as a GP-integer shuffle to auto-vectorize without x86 SSSE3). typedef char swapt; void strrev_explicit(swapt *head, long len) { swapt *tail = head + len - 1; for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } https://godbolt.org/z/wdGv4S compiled with g++ -O3 -march=sandybridge gives us a main loop of ... movq%rcx, %rsi # RSI = RCX before entering the loop addq%rdi, %r8 .L4: vmovdqu (%rcx), %xmm3 # tail load from RCX addq$16, %rax# head subq$16, %rcx# tail subq$16, %rsi# 2nd tail? vmovdqu -16(%rax), %xmm0 vpshufb %xmm2, %xmm3, %xmm1 vmovups %xmm1, -16(%rax) vpshufb %xmm2, %xmm0, %xmm0 vmovups %xmm0, 16(%rsi) # tail store to RSI cmpq%r8, %rax # } while(head != end_head) jne .L4 RSI = RCX before and after the loop. This is obviously pointless. head uses the same register for loads and stores. Then we have bloated fully-unrolled scalar cleanup, instead of using the shuffle control for 8-byte vectors -> movhps. Or scalar bswap. Ideally we'd do something clever at the overlap like one load + shuffle + store, but we might have to load the next vector before storing the current to make this work at the overlap. That would presumably require more special-casing this kind of meet-in-the-middle loop. The implicit-length version doesn't have this extra sub in the main loop. void strrev_implicit(swapt *head) { swapt *tail = head; while(*tail) ++tail;// find the 0 terminator, like head+strlen --tail; // tail points to the last real char for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } .L22: vmovdqu (%rcx), %xmm3 addq$16, %rdx # head subq$16, %rcx # tail vmovdqu -16(%rdx), %xmm0 vpshufb %xmm2, %xmm3, %xmm1 vmovups %xmm1, -16(%rdx) vpshufb %xmm2, %xmm0, %xmm0 vmovups %xmm0, 16(%rcx) cmpq%rsi, %rdx # } while(head != end_head) jne .L22
[Bug tree-optimization/92244] extra sub inside vectorized loop instead of calculating end-pointer
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 --- Comment #1 from Peter Cordes --- On AArch64 (with gcc8.2), we see a similar effect, more instructions in the loop. And an indexed addressing mode. https://godbolt.org/z/6ZVWY_ # strrev_explicit -O3 -mcpu=cortex-a53 ... .L4: ldr q1, [x4, x2]# tail ldr q0, [x3]# head tbl v1.16b, {v1.16b}, v2.16b# byte shuffle tbl v0.16b, {v0.16b}, v2.16b str q1, [x3], 16# post-increment store to head cmp x3, x1 str q0, [x4, x2] sub x2, x2, #16 # doesn't update flags, not SUBS bne .L4 # }while( head != end_head ) # strrev_implicit -O3 -mcpu=cortex-a53 ... .L19: ldr q1, [x3] ldr q0, [x2] tbl v1.16b, {v1.16b}, v2.16b tbl v0.16b, {v0.16b}, v2.16b str q1, [x2], 16 # post-increment addressing mode cmp x2, x4 str q0, [x3], -16 # post-decrement addressing mode bne .L19 # }while( head != end_head )
[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 Peter Cordes changed: What|Removed |Added Summary|extra sub inside vectorized |vectorized loop updating 2 |loop instead of calculating |copies of the same pointer |end-pointer |(for in-place reversal ||cross in the middle) --- Comment #2 from Peter Cordes --- Forgot to update title after looking more carefully at the asm.
[Bug target/92246] New: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246 Bug ID: 92246 Summary: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512) Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* typedef short swapt; void strrev_explicit(swapt *head, long len) { swapt *tail = head + len - 1; for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } g++ -O3 -march=skylake-avx512 (Compiler-Explorer-Build) 10.0.0 20191022 (experimental) https://godbolt.org/z/LS34w9 ... .L4: vmovdqu16 (%rdx), %ymm1 vmovdqu16 (%rax), %ymm0 vmovdqa64 %ymm1, %ymm3# useless copy vpermt2w%ymm1, %ymm2, %ymm3 vmovdqu16 %ymm3, (%rax) vpermt2w%ymm0, %ymm2, %ymm0 addq$32, %rax vmovdqu16 %ymm0, (%rcx) subq$32, %rdx subq$32, %rcx # two tail pointers, PR 92244 is unrelated to this cmpq%rsi, %rax jne .L4 vpermt2w ymm is 3 uops on SKX and CannonLake: 2p5 + p015 (https://www.uops.info/table.html) Obviously better would be vpermw (%rax), %ymm2, %ymm0. vpermw apparently can't micro-micro-fuse a load, but it's only 2 ALU uops plus a load if we use a memory source. SKX still bottlenecks on 2p5 for vpermw, losing only the p015 uop, but in general fewer uops is better. But on CannonLake it runs on p01 + p5 (plus p23 with a memory source). uops.info doesn't have IceLake-client data yet but vpermw throughput on IceLake is 1/clock, vs 1 / 2 clocks for vpermt2w, so this could double throughput on CNL and ICL. We have exactly the same problem with AVX512VBMI vpermt2b over vpermb with ICL g++ -O3 -march=icelake-client -mprefer-vector-width=512
[Bug target/92246] Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246 --- Comment #1 from Peter Cordes --- And BTW, GCC *does* use vpermd (not vpermt2d) for swapt = int or long. This problem only applies to char and short. Possibly because AVX2 includes vpermd ymm. Apparently CannonLake has 1 uop vpermb but 2 uop vpermw, according to real testing on real hardware by https://uops.info/. Their automated test methods are generally reliable. That seems to be true for Ice Lake, too, so when AVX512VBMI is available we should be using vpermb any time we might have used vpermw with a compile-time-constant control vector. (verpmw requires AVX512BW, e.g. SKX and Cascade Lake. vpermb requires AVX512VBMI, only Ice Lake and the mostly aborted CannonLake.) Instlat provides some confirmation: https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel00706E5_IceLakeY_InstLatX64.txt shows vpermb at 3 cycle latency, but vpermw at 4 cycle latency (presumably a chain of 2 uops, 1c and 3c being the standard latencies that exist in recent Intel CPUs). InstLat doesn't document which input the dep chain goes through, so it's not 100% confirmation of only 1 uop. But it's likely that ICL has 1 uop vpermb given that CNL definitely does. uops.info lists latencies separately from each input to the result, sometimes letting us figure out that e.g. one of the inputs isn't needed until the 2nd uop. Seems to be the case for CannonLake vpermw: latency from one of the inputs is only 3 cycles, the other is 4. https://www.uops.info/html-lat/CNL/VPERMW_YMM_YMM_YMM-Measurements.html
[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 --- Comment #4 from Peter Cordes --- (In reply to Andrew Pinski from comment #3) > (In reply to Peter Cordes from comment #1) > > On AArch64 (with gcc8.2), we see a similar effect, more instructions in the > > loop. And an indexed addressing mode. That was an overstatement, the generic tuning I showed isn't using 2 separate pointers or indices like we get on x86. Your thunderx2t99 output is like that, but write-back addressing modes mean it doesn't cost extra instructions. > I am not shocked that IV-OPTS can chose these widly differences. > I have not looked at the cost differences to understand why > -mcpu=thunderx2t99 chose what close might be the best (we could use one less > IV by replacing the first ldr by using the same IV as the last str). I don't know ARM tuning; the x86 version is clearly worse with an extra uop inside the loop. And an extra instruction to copy the register before the loop, wasting code-size if nothing else. On Skylake for example, the loop is 10 uops and bottlenecks on front-end throughput (4 uops / clock) if the back-end can keep up with a bit less than 1 store per clock. (Easy if pointers are aligned and data is hot in L1d). Reducing it to 9 uops should help in practice. Getting it down to 8 uops would be really nice, but we can't do that unless we could use a shuffle that micro-fuses with a load. (For int elements, AVX2 VPERMD can micro-fuse a memory source, so can SSE2 PSHUFD. pshufb's xmm/memory operand is the control vector which doesn't help us. AVX512 vpermb can't micro-fuse)
[Bug target/82459] AVX512BW instruction costs: vpmovwb is 2 uops on Skylake and not always worth using vs. vpack + vpermq lane-crossing fixup
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 Peter Cordes changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill ||a/show_bug.cgi?id=89346 Summary|AVX512F instruction costs: |AVX512BW instruction costs: |vmovdqu8 stores may be an |vpmovwb is 2 uops on |extra uop, and vpmovwb is 2 |Skylake and not always |uops on Skylake and not |worth using vs. vpack + |always worth using |vpermq lane-crossing fixup --- Comment #5 from Peter Cordes --- Turns out vmovdqu8 with no masking doesn't cost an extra uop. IACA was wrong, and Agner Fog's results were *only* for the masked case. The only downside of that is the code-size cost of using EVEX load/store instructions instead of AVX2 VEX. That's bug 89346 https://www.uops.info/table.html confirms that SKX non-masked vmovdqu8 load and store are both single uop. (Or the usual micro-fused store-address + store-data). https://www.uops.info/html-tp/SKX/VMOVDQU8_ZMM_M512-Measurements.html https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_ZMM-Measurements.html And between registers it can be eliminated if there's no masking. But *with* masking, as a load it's a micro-fused load+ALU uop, and as a masked store it's just a normal store uop for xmm and ymm. But zmm masked store is 5 uops (micro-fused to 4 front-end uops)! (Unlike vmovdqu16 or 32 masked stores which are efficient even for zmm). https://www.uops.info/html-tp/SKX/VMOVDQU8_M512_K_ZMM-Measurements.html uops.info's table also shows us that IACA3.0 is wrong about vmovdqu8 as an *unmasked* ZMM store: IACA thinks that's also 5 uops. Retitling this bug report since that part was based on Intel's bogus data, not real testing. vpmovwb is still 2 uops, and current trunk gcc still uses 2x vpmovwb + vinserti64x4 for ZMM auto-vec. -mprefer-vector-width=512 is not the default, but people may enable it in code that heavily uses 512-bit vectors. YMM auto-vec is unchanged since previous comments: we do get vpackusbw + vpermq, but an indexed addressing mode defeats micro-fusion. And we have redundant VPAND after shifting. --- For icelake-client/server (AVX512VBMI) GCC is using vpermt2b, but it doesn't fold the shifts into the 2-source byte shuffle. (vpermt2b has 5c latency and 2c throughput on ICL, so probably its uop count is the same as uops.info measured for CannonLake: 1*p05 + 2*p5. Possible 2x 1-uop vpermb with merge-masking for the 2nd into the first would work better.) IceLake vpmovwb ymm,zmm is still 2-cycle throughput, 4-cycle latency, so probably still 2 uops.
[Bug target/89346] Unnecessary EVEX encoding
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89346 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1 from Peter Cordes --- Still present in pre10.0.0 trunk 20191022. We pessimize vmovdqu/a in AVX2 intrinsics and autovectorization with -march=skylake-avx512 (and arch=native on such machines) It seems only VMOVDQU/A load/store/register-copy instructions are affected; we get AVX2 VEX vpxor instead of AVX512VL EVEX vpxord for xor-zeroing, and non-zeroing XOR. (And most other instructions have the same mnemonic for VEX and EVEX, like vpaddd. This includes FP moves like VMOVUPS/PD) (https://godbolt.org/z/TEvWiU for example) The good options are: * use VEX whenever possible instead of AVX512VL to save code-size. (2 or 3 byte prefix instead of 4-byte EVEX) * Avoid the need for vzeroupper by using only x/y/zmm16..31. (Still has a max-turbo penalty so -mprefer-vector-width=256 is still appropriate for code that doesn't spend a lot of time in vectorized loops.) This might be appropriate for very simple functions / blocks that only have a few SIMD instructions before the next vzeroupper would be needed. (e.g. copying or zeroing some memory); could be competitive on code-size as well as saving the 4-uop instruction. VEX instructions can't access x/y/zmm16..31 so this forces an EVEX encoding for everything involving the vector (and rules out using AVX2 and earlier instructions, which may be a problem for KNL without AVX512VL unless we narrow to 128-bit in an XMM reg) (citation for not needing vzeroupper if y/zmm0..15 aren't written explicitly: https://stackoverflow.com/questions/58568514/does-skylake-need-vzeroupper-for-turbo-clocks-to-recover-after-a-512-bit-instruc - it's even safe to do vpxor xmm0,xmm0,xmm0 vpcmpeqb k0, zmm0, [rdi] without vzeroupper. Although that will reduce max turbo *temporarily* because it's a 512-bit uop. Or more frequently useful: to zero some memory with vpxor xmm zeroing and YMM stores.
[Bug target/40838] gcc shouldn't assume that the stack is aligned
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #91 from Peter Cordes --- This bug should be closed as "resolved fixed". The "fix" was to change the ABI doc and break existing hand-written asm, and old binaries. This was intentional and resulted in some pain, but at this point it's a done deal. My attempt at a summary of the current state of affairs for 32-bit x86 calling conventions (on Linux and elsewhere): Yes, the version of the i386 System V ABI used on Linux really did change between gcc2.8 and gcc8. Those compilers are not ABI-compatible with each other. This is a known fact. Hand-written asm that makes function calls with misaligned stack pointers is violating the (updated) ABI, and was also knowingly broken by this change. (Perhaps unintentionally at first, with stack alignment intended to just provide a performance benefit, not a correctness issue. But the resolution ended up being to standardize on 16-byte alignment matching x86-64 System V. Instead of reverting to the old ABI and breaking compat with new binaries that had started to rely on 16-byte incoming alignment, or to add significant overhead to every function that didn't know how both its caller and callee were compiled, i.e. most functions. Using MOVUPS instead of MOVAPS everywhere wouldn't work well because it would mean no folding of memory operands into ALU instructions: without AVX's VEX encoding, paddd xmm0, [mem] requires aligned mem. And existing binaries that rely on incoming 16-byte alignment weren't doing that.) An earlier comment also mentioned common arrays: the ABI also requires arrays larger than 16 bytes to have 16-byte alignment. Perhaps unnecessary pain for little real benefit: i386 on Linux has been mostly obsolete for a long time, and the inefficient stack-args calling convention was never changed. It's ironic that Linux broke ABI compat for i386 in the name of more efficient SSE-usage despite not caring to introduce anything like Windows fastcall or vectorcall (efficient register-args calling conventions). (GCC does have ABI-changing -mregparm=3 and -msseregparm to pass integers in regs, and pass/return FP values in XMM registers (instead of passing on the stack / returning in x87 st0). But no distros have switched over to using that calling convention for i386 binaries, AFAIK. The Linux kernel does use regparm for 32-bit kernel builds.) Even more ironic, probably a lot of 32-bit code is compiled without -msse2 (because one of the main reasons for using 32-bit code is CPUs too old for x86-64, which is about the same vintage as SSE2). SSE usage can still happen with runtime dispatching in binaries that are compatible with old machines while still being able to take advantage of new ones. But in most cases, if you want performance you use x86-64 kernel + user-space, or maybe x32 user-space (ILP32 in 64-bit mode) to get modern calling conventions and the benefit of twice as many registers. x86-64 System V has mandated 16-byte stack alignment from the start. (I don't know the history, but perhaps i386 code-gen started assuming / depending on it for correctness, not just performance, by accident because of devs being used to x86-64?) The 32-bit ABI on some other OSes, including i386 *BSD and 32-bit Windows, has *not* changed; presumably gcc there doesn't rely on incoming stack alignment. (It might try to propagate 16-byte alignment for performance benefits, though.) My understanding is that i386 MacOS still uses a version of i386 System V that doesn't include the 16-byte stack alignment update, like other *BSDs. (In reply to Harald van Dijk from comment #90) > compile > > void exit(int); > int main(void) { exit(0); } > > with GCC 2.8, compile current glibc with GCC 8, and there will be a segfault > in glibc's __run_exit_handlers because GCC 2.8 never kept the stack > 16-byte-aligned, but GCC 8 does now generate code which assumes it. > > For the moment, I've rebuilt glibc with -mincoming-stack-boundary=2 to handle > the problem well enough for my current needs, but it's not a complete > solution. Yes, you need workarounds like this to change modern GCC's ABI back to legacy 4-byte. Note that you might break atomicity of C11 _Atomic 8-byte objects even outside structs by doing this, if they split across a cache line (Intel) or possibly narrower (AMD) boundary. But only if they were stack allocated.
[Bug target/93141] Missed optimization : Use of adc when checking overflow
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93141 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2 from Peter Cordes --- gcc doesn't actually *branch* unless you use an if(), it just uses cmp/sbb to do a 128-bit compare. CMP is like a SUB that only sets flags. The CF result of SBB is used as an input for ADC. https://godbolt.org/z/64C4R- of a testcase GCC also wastes a varying number of MOV instructions beyond the minimum one to make cmp/sbb work, depending on BMI2 MULX or not, and how the sum is written. u128 prod = a[i] * (unsigned __int128) b[i]; #if 1 sum += prod; //if(sum
[Bug target/89063] [x86] lack of support for BEXTR from BMI extension
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1 from Peter Cordes --- Unfortunately Intel Haswell/Skylake implement BEXTR as 2 uops with 2c latency. Presumably those uops are a shift + bzhi, so 1p06 + 1p15 would explain Agner Fog's experimental result of 2p0156 for BEXTR, with 0.5c throughput. On AMD Excavator/Ryzen, it's 1 uop with 1c latency. On Steamroller and earlier, it's 2 uops but 1c latency. (I assume that's latency from the non-control input to the output. So maybe one of the uops pre-processes the control input, otherwise you'd expect 2c latency from either operand.) Ryzen dropped support for AMD TBM, so only Excavator (bdver4) has 1-uop bextr imm16 which would avoid the need for mov reg,imm32 with the control operand. But mov-imm + bextr can still be a win on Ryzen, lower latency than RORX+AND BMI2 RORX is single-uop on all CPUs that support it. If we already need a 2nd uop to mask anyway, we can use RORX+AND-immediate to duplicate the functionality and performance of BEXTR-immediate, with the smaller code-size if the AND-mask fits in an imm8. (5+5 vs. 6+3 or 6+4 if the AND needs a REX) Without an immediate-source BEXTR (like AMD TBM has/had), the only advantage mov-immediate+bextr has (on Intel) over mov-reg+shift+and is that can deal with wide bitfields using a count instead of an immediate AND mask. (Especially if it doesn't fit in 32 bits). If you can reuse the same control-register in a loop, BEXTR is good-ish for copy-and-extract. PEXT is 1 uop on Intel CPUs even though the simpler-looking BEXTR is 2. But PEXT is extremely slow on Ryzen (7 uops, 18c lat and tput). So for 32-bit constants at least, mov r32,imm32 + PEXT to copy-and-extract is better than BEXTR on Intel. movabs imm64 is too big and can cause front-end problems (slower to read from the uop cache, if that effect from Sandybridge is still present on Haswell/Skylake), and has no advantage vs. RORX + AND unless the bitfield you're extracting is wider than 32 bits. PEXT has 3 cycle latency, though, and can only run on port 1 on SnB-family. (All integer uops with latency > 1 are p1-only). It's potentially good for throughput, but worse than RORX+AND for latency. Unfortunately x86 bitfield instructions are pretty weak compared to ARM / AArch64 ubfx or PowerPC rlwinm and friends, where the bit-positions are simply specified as immediates. Only AMD's immediate version of BEXTR (1 uop on Excavator) matched them. Having a bunch of different control operands for BEXTR or PEXT in registers might be usable in a loop, but a lot more rarely useful than immediate controls. : 0: c4 e3 fb f0 c7 2a rorx $0x2a,%rdi,%rax# $(64-22) 6: c4 e3 fb f0 d7 35 rorx $0x35,%rdi,%rdx# $(64-11) c: 83 e7 3fand$0x3f,%edi f: 83 e0 3fand$0x3f,%eax 12: 83 e2 3fand$0x3f,%edx 15: 01 f8 add%edi,%eax # 32-bit operand-size because we can prove it can't overflow 17: 01 d0 add%edx,%eax # missed optimization in both gcc's versions. 19: c3 retq Not counting the ret, this is 7 uops for Skylake and Ryzen. **I'm pretty sure this is our best bet for -march=skylake, and for tune=generic -mbmi2** The BEXT intrinsics version is 9 uops for SKL, 7 for Ryzen, but is 2 bytes larger. (not counting the savings from avoiding a REX prefix on the ADD instructions; that missed optimization applies equally to both.) OTOH, the critical path latency for BEXTR on Ryzen is better by 1 cycle, so we could still consider it for -march=znver1. Or for tune=generic -mbmi without BMI2. The legacy mov+shr+and version is 10 uops because gcc wasted a `mov %rdi,%rax` instruction; it *should* be 9 uops for all normal CPUs. --- With only BMI1 but not BMI2 enabled, we should probably use the mov-imm + BEXTR version. It's not worse than the mov+shr+and version on SnB-family or bd/zn, and it's better on some AMD. And it's probably smaller code-size. And in future if Intel designs CPUs that can handle BEXTR as a single uop with 1c latency, mov+bextr will become good-ish everywhere. For code-size, BEXTR has a definite advantage for bitfields wider than 1 byte, because AND $imm32, %r32 is 6 bytes long instead of 3.
[Bug target/89071] New: AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 Bug ID: 89071 Summary: AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- float cvt(double unused, double xmm1) { return xmm1; } g++ (GCC-Explorer-Build) 9.0.0 20190120 (experimental): vxorps %xmm0, %xmm0, %xmm0 vcvtsd2ss %xmm1, %xmm0, %xmm0# merge into XMM0 clang7.0 vcvtsd2ss %xmm1, %xmm1, %xmm0# both sources are from XMM1, no false dep gcc already uses this trick for SQRTSS/SD, but not for float<->double conversion. I haven't checked all the other scalar instructions, but roundss for floor() does neither and has a false dependency. (i.e. it chooses the output register as the merge-target, not the actual input.) return floorf(x); -> vroundss$9, %xmm1, %xmm0, %xmm0 Some testcases: https://godbolt.org/z/-rqUVZ --- In SSE, one-input scalar instructions like CVT* and SQRTSS/SD have an output dependency because of Intel's short-sighted ISA design optimizing for Pentium-III's 64-bit SIMD: zero-extending to fill the destination XMM register would have cost an extra uop to write the upper half of the destination. For consistency(?), SSE2 scalar instructions (new with Pentium 4 which had 128-bit SIMD execution units / register file) have the same behaviour of merging into the low 64 bits of the destination, even conversion between double and float between two xmm registers, which didn't exist before SSE2. (Previously conversion instructions were only between float in XMM and integers in scalar or MMX regs, or packed-integer <-> ps which filled the whole XMM reg and thus avoided a false dependency). (Fortunately this isn't a problem for 2-input instructions like ADDSS: the operation already depends on both registers.) --- The VEX encoding makes the merge-target separate from the actual destination, so we can finally avoid false dependencies without wasting an instruction breaking it. (When the source is already in an XMM register). For instructions where the source isn't an XMM register (e.g. memory or integer reg for int->FP conversions), one zeroed register can be used as a read-only merge target by any number of scalar AVX instructions, including in a loop. That's bug 80571. (It's unfortunate that Intel didn't take the opportunity to give the AVX versions subtly different semantics, and zero-extend into the target register. That would probably have enabled vcvtsd2ss to be single-uop instead of 2 on Sandybridge-family. IDK if they didn't think of that, or if they wanted strict consistency with the semantics of the SSE version, or if they thought decoding / internals would be easier if they didn't have to omit the merge-into-destination part of the scalar operation. At least they made the extra dependency an explicit input, so we can choose a register other than the destination, but it's so rarely useful to actually merge into the low 64 or 32 of another reg that it's just long-term harmful to gimp the ISA with an extra dependency for these instructions, especially integer->FP.) (I suspect that most of the dep-breaking gcc does isn't gaining any speed, but the trick is figuring out when we can omit it while being sure that we don't couple things into one big loop-carried chain, or serialize some things that OoO exec could otherwise benefit from hiding. Within one function with no calls, we might be able to prove that a false dep isn't serializing anything important (e.g. if there's already enough ILP and something else breaks a dep on that register between loop iterations), but in general it's hard if we can't pick a register that was already part of the dep chain that led to the input for this operation, and thus is harmless to introduce a dep on.) Relevant instructions that can exist in scalar xmm,xmm form: VROUNDSS/SD (gcc leaves a false dep, clang gets it right) VSQRTSS/SD (gcc already gets this right) VRCPSS VRSQRTSS haven't checked [V]CVTSS2SD xmm,xmm (Skylake: SRC1/output dependency is a separate 1c latency 32-bit merge uop) The memory-source version is still 2 uops. [V]CVTSD2SS xmm,xmm (Skylake: SRC1/output dependency is the main 4c conversion uop, the extra uop is first, maybe extracting 32 bits from the src?) The memory-source version of [V]CVTSD2SS is only 1 uop! So avoiding a false dep by loading with MOVSS/MOVSD and then using the reg-reg version is a bad idea for CVTSD2SS. It's actually much better to PXOR and then CVTSD2SS (me
[Bug target/80586] vsqrtss with AVX should avoid a dependency on the destination register.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80586 Peter Cordes changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |FIXED --- Comment #1 from Peter Cordes --- Fixed for vsqrtss/sd somewhere in 9.0, but not 8.2. https://godbolt.org/z/0Gxf05. The general case of one-input scalar xmm,xmm instructions like vcvtss2sd is still all over the place, with false deps or wasted xor-zeroing. Reported that as bug 89071 It seems only VSQRTsd/ss itself was fixed for this; sorry I didn't think of checking for other one-input instructions when I reported this.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #2 from Peter Cordes --- (In reply to H.J. Lu from comment #1) > But > > vxorps %xmm0, %xmm0, %xmm0 > vcvtsd2ss %xmm1, %xmm0, %xmm0 > > are faster than both. On Skylake-client (i7-6700k), I can't reproduce this result in a hand-written asm loop. (I was using NASM to make a static executable that runs a 100M iteration loop so I could measure with perf). Can you show some asm where this performs better? vcvtsd2ss src-reg,dst,dst is always 2 uops, regardless of the merge destination being an xor-zeroed register. (Either zeroed outside the loop, or inside, or once per 4 converts with an unrolled loop.) I can't construct a case where vcvtsd2ss %xmm1, %xmm1, %xmm0 is worse in any way (dependencies, uops, latency, throughput) than VXORPS + vcvtsd2ss with dst = middle source. I wasn't mixing it with other instructions other than VXORPS, but I don't think anything is going to get rid of its 2nd uop, and choosing both inputs = the same source removes any benefit from dep-breaking the output. If adding a VXORPS helped, its probably due to some other side-effect. Could the effect you saw have been due to code-gen changes for memory sources, maybe vxorps + vcvtsd2ss (mem), %xmm0, %xmm0 vs. vmovsd + vcvtsd2ss %xmm1, %xmm1, %xmm0? (Those should be about equal, but memory-source SS2SD is cheaper, no port5 uop.) BTW, the false-dependency effect is much more obvious with SS2SD, where the latency from src1 to output is 4 cycles, vs. 1 cycle for SD2SS. Even without dependency-breaking, repeated vcvtsd2ss %xmm1, %xmm0, %xmm0 can run at 1 per clock (same as with dep breaking), because the port-5 uop that merges into the low 32 bits of xmm0 with 1 cycle latency is 2nd. So latency from xmm0 -> xmm0 for that [v]cvtsd2ss %xmm1, %xmm0 is 1 cycle. With dep-breaking, they both still bottleneck on the port5 uop if you're doing nothing else.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #3 from Peter Cordes --- (In reply to H.J. Lu from comment #1) I have a patch for PR 87007: > > https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html > > which inserts a vxorps at the last possible position. vxorps > will be executed only once in a function. That's talking about the mem,reg case, which like I said is different. I reported Bug 80571 a while ago about the mem,reg case (or gp-reg for si2ss/d), so it's great that you have a fix for that, doing one xor-zeroing and reusing that as a merge target for a whole function / loop. But this bug is about the reg,reg case, where I'm pretty sure there's nothing to be gained from xor-zeroing anything. We can fully avoid any false dep just by choosing both source registers = src, making the destination properly write-only. If you *have* an xor-zeroed register, there's no apparent harm in using it as the merge-target for a reg-reg vcvt, vsqrt, vround, or whatever, but there's no benefit either vs. just setting both source registers the same. So whichever is easier to implement, but ideally we want to avoid introducing a vxorps into functions / blocks that don't need it at all.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #5 from Peter Cordes --- (In reply to H.J. Lu from comment #4) > (In reply to Peter Cordes from comment #2) > > Can you show some > > asm where this performs better? > > Please try cvtsd2ss branch at: > > https://github.com/hjl-tools/microbenchmark/ > > On Intel Core i7-6700K, I got I have the same CPU. > [hjl@gnu-skl-2 microbenchmark]$ make > gcc -g -I.-c -o test.o test.c > gcc -g -c -o sse.o sse.S > gcc -g -c -o sse-clear.o sse-clear.S > gcc -g -c -o avx.o avx.S > gcc -g -c -o avx2.o avx2.S > gcc -g -c -o avx-clear.o avx-clear.S > gcc -o test test.o sse.o sse-clear.o avx.o avx2.o avx-clear.o > ./test > sse : 24533145 > sse_clear: 24286462 > avx : 64117779 > avx2 : 62186716 > avx_clear: 58684727 > [hjl@gnu-skl-2 microbenchmark]$ You forgot the RET at the end of the AVX functions (but not the SSE ones); The AVX functions fall through into each other, then into __libc_csu_init before jumping around and eventually returning. That's why they're much slower. Single-step through the loop in GDB... │0x5660 vcvtsd2ss xmm0,xmm0,xmm1 >│0x5664 nopWORD PTR cs:[rax+rax*1+0x0] │0x566e xchg ax,ax │0x5670vcvtsd2ss xmm0,xmm1,xmm1 │0x5674 nopWORD PTR cs:[rax+rax*1+0x0] │0x567e xchg ax,ax │0x5680 vxorps xmm0,xmm0,xmm0 │0x5684 vcvtsd2ss xmm0,xmm0,xmm1 │0x5688 nopDWORD PTR [rax+rax*1+0x0] │0x5690 <__libc_csu_init>endbr64 │0x5694 <__libc_csu_init+4> push r15 │0x5696 <__libc_csu_init+6> movr15,rdx And BTW, SSE vs. SSE_clear are about the same speed because your loop bottlenecks on the store/reload latency of keeping a loop counter in memory (because you compiled the C without optimization). Plus, the C caller loads write-only into XMM0 and XMM1 every iteration, breaking any loop-carried dependency the false dep would create. I'm not sure why it makes a measurable difference to run the extra NOPS, and 3x vcvtsd2ss instead of 1 for avx() vs. avx_clear(), because the C caller should still be breaking dependencies for the AVX-128 instructions. But whatever the effect is, it's totally unrelated to what you were *trying* to test. :/
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #6 from Peter Cordes --- (In reply to Peter Cordes from comment #5) > But whatever the effect is, it's totally unrelated to what you were *trying* > to test. :/ After adding a `ret` to each AVX function, all 5 are basically the same speed (compiling the C with `-O2` or -O2 -march=native), with just noise making it hard to see anything clearly. sse_clear tends to be faster than sse in a group of runs, but if there are differences it's more likely due to weird front-end effects and all the loads of inputs + store/reload of the return address by call/ret. I did while ./test; : ;done to factor out CPU clock-speed ramp up and maybe some cache warmup stuff, but it's still noisy from run to run. Making printf/write system calls between tests will cause TLB / branch-prediction effects because of kernel spectre mitigation, so I guess every test is in the same boat, running right after a system call. Adding loads and stores into the mix makes microbenchmarking a lot harder. Also notice that since `xmm0` and `xmm1` pointers are global, those pointers are reloaded every time through the loop even with optimization. I guess you're not trying to minimize the amount of work outside of the asm functions, to measure them as part of a messy loop. So for the version that have a false dependency, you're making that dependency on the result of this: movrax,QWORD PTR [rip+0x2ebd] # reload xmm1 vmovapd xmm1,XMMWORD PTR [rax+rbx*1] # index xmm1 Anyway, I think there's too much noise in the data, and lots of reason to expect that vcvtsd2ss %xmm0, %xmm0, %xmm1 is strictly better than VPXOR+convert, except in cases where adding an extra uop actually helps, or where code-alignment effects matter.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #8 from Peter Cordes --- Created attachment 45544 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45544&action=edit testloop-cvtss2sd.asm (In reply to H.J. Lu from comment #7) > I fixed assembly codes and run it on different AVX machines. > I got similar results: > > ./test > sse : 28346518 > sse_clear: 28046302 > avx : 28214775 > avx2 : 28251195 > avx_clear: 28092687 > > avx_clear: > vxorps %xmm0, %xmm0, %xmm0 > vcvtsd2ss %xmm1, %xmm0, %xmm0 > ret > > is slightly faster. I'm pretty sure that's a coincidence, or an unrelated microarchitectural effect where adding any extra uop makes a difference. Or just chance of code alignment for the uop-cache (32-byte or maybe 64-byte boundaries). You're still testing with the caller compiled without optimization. The loop is a mess of sign-extension and reloads, of course, but most importantly keeping the loop counter in memory creates a dependency chain involving store-forwarding latency. Attempting a load later can make it succeed more quickly in store-forwarding cases, on Intel Sandybridge-family, so perhaps an extra xor-zeroing uop is reducing the average latency of the store/reloads for the loop counter (which is probably the real bottleneck.) https://stackoverflow.com/questions/49189685/adding-a-redundant-assignment-speeds-up-code-when-compiled-without-optimization Loads are weird in general: the scheduler anticipates their latency and dispatches uops that will consume their results in the cycle when it expects a load will put the result on the forwarding network. But if the load *isn't* ready when expected, it may have to replay the uops that wanted that input. See https://stackoverflow.com/questions/54084992/weird-performance-effects-from-nearby-dependent-stores-in-a-pointer-chasing-loop for a detailed analysis of this effect on IvyBridge. (Skylake doesn't have the same restrictions on stores next to loads, but other effects can cause replays.) https://stackoverflow.com/questions/52351397/is-there-a-penalty-when-baseoffset-is-in-a-different-page-than-the-base/52358810#52358810 is an interesting case for pointer-chasing where the load port speculates that it can use the base pointer for TLB lookups, instead of the base+offset. https://stackoverflow.com/questions/52527325/why-does-the-number-of-uops-per-iteration-increase-with-the-stride-of-streaming shows load replays on cache misses. So there's a huge amount of complicating factors from using a calling loop that keeps its loop counter in memory, because SnB-family doesn't have a simple fixed latency for store forwarding. If I put the tests in a different order, I sometimes get results like: ./test sse : 26882815 sse_clear: 26207589 avx_clear: 25968108 avx : 25920897 avx2 : 25956683 Often avx (with the false dep on the load result into XMM1) is slower than avx_clear of avx2, but there's a ton of noise. Adding vxorps %xmm2, %xmm2, %xmm2 to avx.S also seems to have sped it up; now it's the same speed as the others, even though I'm *not* breaking the dependency chain anymore. XMM2 is unrelated, nothing touches it. This basically proves that your benchmark is sensitive to extra instructions, whether they interact with vcvtsd2ss or not. We know that in the general case, throwing in extra NOPs or xor-zeroing instructions on unused registers does not make code faster, so we should definitely distrust the result of this microbenchmark. I've attached my NASM loop. It has various commented-out loop bodies, and notes in comments on results I found with performance counters. I don't know if it will be useful (because it's a bit messy), but it's what I use for testing snippets of asm in a static binary with near-zero startup overhead. I just run perf stat on the whole executable and look at cycles / uops.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #10 from Peter Cordes --- (In reply to Uroš Bizjak from comment #9) > There was similar patch for sqrt [1], I think that the approach is > straightforward, and could be applied to other reg->reg scalar insns as > well, independently of PR87007 patch. > > [1] https://gcc.gnu.org/ml/gcc-patches/2018-05/msg00202.html Yeah, that looks good. So I think it's just vcvtss2sd and sd2ss, and VROUNDSS/SD that aren't done yet. That patch covers VSQRTSS/SD, VRCPSS, and VRSQRTSS. It also bizarrely uses it for VMOVSS, which gcc should only emit if it actually wants to merge (right?). *If* this part of the patch isn't a bug - return "vmovss\t{%1, %0, %0|%0, %0, %1}"; + return "vmovss\t{%d1, %0|%0, %d1}"; then even better would be vmovaps %1, %0 (which can benefit from mov-elimination, and doesn't need a port-5-only ALU uop.) Same for vmovsd of course.
[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #4 from Peter Cordes --- I suspect dep-chains are the problem, and branching to skip work is a Good Thing when it's predictable. (In reply to Richard Biener from comment #2) > On Skylake it's better (1uop, 1 cycle latency) while on Ryzen even better. > On Bulldozer it also isn't that bad (comparable to Skylake I guess). SKL: AVX VBLENDVPS x,x,x,x is 2 uops, 2c latency, ~1c throughput. (Same for ymm) SKL: SSE4 BLENDVPS x,x,xmm0 is 1 uop, 1c latency, ~0.36c throughput in my testing, or maybe 0.333c with breaking dep chains. (IDK how Agner got 1c. Maybe he that was an editing mistake, and he copied the 1c from the VEX version.) [V](P)BLENDV(B|PS|PD) is funny: the SSE versions are 1 uop on SKL, I assume because they only have 3 register operands (including implicit XMM0). But the VEX encoding has 4 operands: 1 output and 3 inputs. I think this is too many for 1 uop to encode, and that's why VBLENDVPS is 2 uops even on Skylake. (The blend-control register encoded by an imm8 in the VEX version instead of implicit xmm0, but I don't think that's what stops the decoders from making it 1 uop. I think it's simply having 4 total operands.) On Skylake, the uop(s) for [V]BLENDVPS/D and [V]PBLENDVB can run on any of p015 (instead of only p5 on BDW and earlier), but the 2-uop VEX version is still 2 cycle latency. The VEX version has a bias towards port 5, but less than half the total uops run on p5 so it's not p015 + p5. The SSE version seems equally distributed to all of p015. On SKL, the optimal choice might be to use the SSE encoding, if we can deal with a destructive destination and having the blend control in xmm0. The SSE/AVX penalty on SKL is output dependencies for write-only SSE instructions (like movaps or cvtps2dq) writing to an XMM register that has a dirty upper 128. It's a per-register thing, not like Haswell where there's it triggers a state slow change. (https://stackoverflow.com/questions/41303780/why-is-this-sse-code-6-times-slower-without-vzeroupper-on-skylake) --- Footnote: VBLENDVPS throughput is only 1c for a big block of it back-to-back, even though it's only 2 uops that can run on any of 3 ports. So why isn't it 0.66c throughput? VBLENDVPS throughput (for back-to-back vblendvps) seems to be limited by some front-end effect. In an unrolled loop with 20 vblendvps (with no loop-carried dependencies), there are a negligible amount of cycles where the front-end delivered the full 4 uops. Most cycles only 2 are issued. This is not a general a problem for 2 uop instructions or anything: 9x bextr + dec/jnz = 19 uops total runs at 5.00c / iter, or 3.8 uops / clock, with the only cycle to not issue 4 uops being (I think) the group of 3 including the loop branch. Playing around with other 2 uops instructions, I didn't see front-end bottlenecks. I saw some back-end bottlenecks because other 2-uop instructions aren't so nicely distributed over ports, but perf counts for idq_uops_not_delivered.cycles_fe_was_ok:u generally equaled total cycles. (It counts when either the FE delivers 4 uops, or the back end was stalled and thus not the front-end's fault.) A 1 uop instruction following a vblendvps can issue with it in the same cycle, so this effect is probably not horrible for normal cases where we're using vblendvps mixed with normal instructions. I haven't investigated further, whether this is a front-end effect (uop cache fetch problem?) or whether it's an allocation bottleneck. Possibly being a 4-operand instruction has something to do with it, although each uop can't have that many I don't think.
[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #5 from Peter Cordes --- IF ( xij.GT.+HALf ) xij = xij - PBCx IF ( xij.LT.-HALf ) xij = xij + PBCx For code like this, *if we can prove only one of the IF() conditions will be true*, we can implement it more efficiently, I think, by checking the magnitude of xij to see if a SUB is needed, and if so figuring out the sign to apply to PBCx. if(abs(xij) > HALF) { xij -= PBCx XOR sign_bit( xij ) } # xij in xmm0 # PBCx in xmm7 # HALF in xmm6 # set1( -0.0f ) in xmm5 (i.e. 1U<<31 a sign-bit mask) vandnps%xmm5, %xmm0, %xmm1# abs(xij) vcmpltps %xmm1, %xmm6, %xmm1# HALF < abs(xij) vandps%xmm5, %xmm0, %xmm2 # signbit(xij) vxorps%xmm7, %xmm2, %xmm2 # PBCX (xij>=0) or -PBCx (xij<0) vandps%xmm2, %xmm1, %xmm1 # +-PBCx or 0.0 if abs(xij) is between -+HALF vsubps%xmm1, %xmm0, %xmm0 # xij -= PBCx, -PBCx, or 0.0 There's a good amount of ILP here, but the critical path is ANDPS + CMPPS + ANDPS + SUBPS = 10 cycles on Skylake. We might want to use VPAND for some of this on Haswell, to avoid a port 5 bottleneck at least on the critical path. (Skylake runs FP booleans on any port. BDW and earlier restrict them to port 5 where they can't compete with FMA, and where bypass latency is always optimal. On SKL they can introduce extra bypass latency if they pick p0 or p1.) vandnps %xmm5, %xmm0, %xmm2 # signbit(xij) vxorps%xmm7, %xmm2, %xmm2 # PBCX (xij>=0) or -PBCx (xij<0) could be replaced with a (v)blendvps using the original xij to select between PBCx and -PBCx. With the SSE encoding, that saves a uop and a cycle of latency (but only off the critical path). And I think it would cost us a vmovaps to set up for it. --- I think this is better than IF-conversion of both IFs separately, but I haven't really looked. It should be much better for *latency*. But it's only equivalent if subtracting PBCx can't possibly make xij negative and the next IF condition also true. --- I was looking at a similar case of applying a fixup if the abs value of an input is outside a range in https://stackoverflow.com/questions/54364694/how-to-convert-scalar-code-of-the-double-version-of-vdts-pade-exp-fast-ex-app/54377840#54377840. I don't think I came up with anything there that's not already obvious or covered by the example above, though. Except if we had needed to square xij at some point, we could have checked xij * xij < HALF*HALF as the bound condition to save the ANDNPS. But then the mulps latency is part of the input to cmpps.
[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #6 from Peter Cordes --- Oops, these were SD not SS. Getting sleepy >.<. Still, my optimization suggestion for doing both compares in one masked SUB of +-PBCx applies equally. And I think my testing with VBLENDVPS should apply equally to VBLENDVPD. Since this is `double`, if we're going branchless we should definitely be vectorizing for a pair of doubles, like doing xij = X0(1,i) - X0(1,j) and yij = X0(2,i) - X0(2,j) together with a vmovupd, and a vector of PBCx, PBCy. Even if we later need both x and y separately (if those FMAs in the asm are multiplying components of one vector), we might still come out ahead from doing the expensive input processing with PD, then it's only one `vunpckhpd` to get the Y element ready, and that can run in parallel with any x * z stuff Or if we can unroll by 3 SIMD vectors over contiguous memory, we can get {X0,Y0} {Z0,X1} {Y1,Z1}. We get twice the work for a cost of only 3 extra unpacks, doing 2 i and j values at once. If this was 3 floats, using a SIMD load would be tricky (maybe vmaskmovps if we need to avoid going off the end), unless we again unroll by 3 = LCM(vec_len, width)
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #15 from Peter Cordes --- (In reply to Uroš Bizjak from comment #13) > I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP > and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we > currently don't emit XOR clear in front of these instrucitons, when they > operate with memory input. They *do* have an output dependency. It might or might not actually be a problem and be worth clogging the front-end with extra uops to avoid, it depending on surrounding code. >.< e.g. ROUNDSD: DEST[127:63] remains unchanged Thanks, Intel. You'd think by SSE4.1 they would have learned that false dependencies suck, and that it's extremely rare to actually take advantage of this merge behaviour, but no. For register-source ROUNDSD / ROUNDSS, we can use ROUNDPD / ROUNDPS which write the full destination register and have identical performance on all CPUs that support them. (Except Silvermont, where roundps/pd have 5c latency vs. 4c for roundss/sd. Goldmont makes them equal.) KNL has faster (V)ROUNDPS/D than ROUNDSS/SD, maybe only because of the SSE encoding? Agner Fog isn't clear, and doesn't have an entry that would match vroundss/sd. Copy-and-round is good for avoiding extra MOVAPS instructions which can make SSE code front-end bound, and reduce the effective size of the out-of-order window. Preserving FP exception semantics for packed instead of scalar register-source: * if the upper element(s) of the source is/are known 0, we can always do this with sqrt and round, and convert: they won't produce any FP exceptions, not even inexact. (But not rsqrt / rcpps, of course.) This will be the case after a scalar load, so if we need the original value in memory *and* the result of one of these instructions, we're all set. * with rounding, the immediate can control masking of precision exceptions, but not Invalid which is always raised by SRC = SNaN. If we can rule out SNaN in the upper elements of the input, we can use ROUNDPS / ROUNDPD roundps/d can't produce a denormal output. I don't think denormal inputs slow it down on any CPUs, but worth checking for cases where we don't care about preserving exception semantics and want to use it with potentially-arbitrary garbage in high elements. rsqrtps can't produce a denormal output because sqrt makes the output closer to 1.0 (reducing the magnitude of the exponent). (And thus neither can sqrtps.) SQRTPS/PD is the same performance as SQRTSS/SD on new CPUs, but old CPUs that crack 128-bit ops into 64-bit are slower: Pentium III, Pentium M, and Bobcat. And Jaguar for sqrt. Also Silvermont is *MUCH* slower for SQRTPD/PS then SD/SS, and even Goldmont Plus has slower packed SQRT, RSQRT, and RCP than scalar. But RCPPS can produce a denormal. (double)1.0/FLT_MAX = 2.938736e-39, which is smaller than FLT_MIN = 1.175494e-38 So according to Agner's tables: * ROUNDPS/PD is never slower than ROUNDSS/SD on any CPU that support them. * SQRTPS/PD *are* slower than scalar on Silvermont through Goldmont Plus, and Bobcat, Nano 3000, and P4 Prescott/Nocona. By about a factor of 2, enough that should probably care about it for tune=generic. For ss/ps only (not double), also K10 and Jaguar have slower sqrtps than ss. Also in 32-bit mode, P4, Pentium M and earlier Intel, and Atom, are much slower for packed than scalar sqrt. SQRTPD is *faster* than SQRTSD on KNL. (But hopefully we're never tuning for KNL without AVX available.) * RSQRT / RCP: packed is slower on Atom, Silvermont, and Goldmont (multi-uop so a big decode stall). Somewhat slower on Goldmont Plus (1 uop but half throughput). Also slower on Nano3000, and slightly slower on Pentium 4 (before and after Prescott/Nocona), and KNL. (But hopefully KNL can always use VRSQRT28PS/PD or scalar) Pentium M and older again decode as at least 2 uops for packed, same as Bobcat and K8. Same performance for packed vs. scalar on Jaguar, K10, bdver1-4, ryzen, Core2 and later, and SnB-family. * CVTSS2SD vs. PD, and SD2SS vs. PD2PS packed is slower on k8, bdver1-4 (scalar avoids the shuffle uop), Nano3000, KNL. On Silvermont by just 1 cycle latency (so even a MOVAPS on the critical path would make it equal.) Similar on Atom. Slower on CPUs that do 128-bit vectors as two 64-bit uops, like Bobcat, and Pentium M / K8 and older. packed is *faster* on K10, Goldmont/GDM Plus (same latency, 1c vs. 2c throughput), Prescott, P4. Much faster on Jaguar (1c vs. 8c throughput, and 1 uop vs. 2). same speed (but without the false dep) for SnB-family (mostly), Core 2, Ryzen. Odd stuff: Agner reports: Nehalem: ps2pd = 2 uops / 2c, ss2sd = 1 uop / 1c. (I guess just zero-padding the significand, no rounding required). pd2ps and sd2ss are equal at 2 uops / 4c latency. SnB: cvtpd2ps is 1c higher latency than sd2ss. IvB: ps2pd on IvB is 1c vs. 2c for ss2sd On HSW and later things have settled down to e
[Bug target/85366] New: Failure to use both div and mod results of one IDIV in a prime-factor loop while(n%i==0) { n/=i; }
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85366 Bug ID: 85366 Summary: Failure to use both div and mod results of one IDIV in a prime-factor loop while(n%i==0) { n/=i; } Product: gcc Version: 8.0.1 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* From https://codereview.stackexchange.com/questions/191792/find-prime-factors-in-c/191801#191801, simplified to use a pointer instead of returning std::vector. Interestingly, the version with std::vector can be more easily coaxed to use both results of one idiv, see the Godbolt link. void find_prime_factors_ptr(int n, int *p) { // inefficient to test even numbers > 2, but that's a separate missed optimization. for (int i = 2; i <= n; i++) { while (n % i == 0) { *p++ = i; n /= i; // reordering the loop body doesn't help } } } https://godbolt.org/g/ogyZW8 g++ 8.0.1 20180411 -O3 -march=haswell gives us this inner loop: ... # outer loop movl%edi, %eax # idiv to test if inner loop should even run once, leaving n/i in eax .L4: movl%edi, %eax# but instead we discard it addq$4, %rsi movl%ecx, -4(%rsi) cltd idivl %ecx cltd # then modulo that division result to see if the next iteration should run movl%eax, %edi idivl %ecx # leaves n/i in eax, ready for next iteration... testl %edx, %edx je .L4 ... So both ways to get to .L4 (fall in or loop) have n/i in EAX from an idiv already! The loop doesn't need to be re-structured to take advantage, gcc just needs to keep track of what it's doing. ## Hand optimized version of the whole function: cmpl$1, %edi jle .L9 movl$2, %ecx .L5: movl%edi, %eax cltd idivl %ecx # eax = tmp = n/i testl %edx, %edx jne .L3 .L4: movl%ecx, (%rsi) addq$4, %rsi # we're tuning for Haswell, no register-read stalls so increment after reading and save a byte in the addressing mode movl%eax, %edi# n = tmp cltd idivl %ecx # eax = tmp = n/i testl %edx, %edx je .L4 .L3: incl%ecx cmpl%edi, %ecx jle .L5 .L9: ret I didn't make *any* changes to the code outside the inner loop. I ended up just removing movl %edi, %eax / cltd / idiv %ecx. Changing the inner loop to int tmp; while (tmp = n/i, n % i == 0) { *p++ = i; n = tmp; } gives us the asm almost that good (an extra mov inside the loop), but we get a jmp into the loop instead of peeling the while condition from before the first iteration: # gcc8.0.1 -O3 -march=haswell output, commented but unmodified find_prime_factors_ptr_opt(int, int*): cmpl$1, %edi jle .L18 movl$2, %ecx jmp .L19 .L16: # top of inner loop addq$4, %rsi movl%ecx, -4(%rsi) movl%eax, %edi# extra mov puts this and the next mov on the critical path .L19:# inner loop entry point movl%edi, %eax cltd idivl %ecx testl %edx, %edx je .L16 # bottom of inner incl%ecx cmpl%edi, %ecx jle .L19 # bottom of outer .L18: ret Saving code-size here with the dependent chain of movl %eax, %edi / movl %edi, %eax is pretty minor even on CPUs like original Sandybridge, or Bulldozer, without mov-elimination, because idiv's latency dominates. But it could easily be taken out of the inner loop by duplicating it outside the outer loop, then moving it to the outer-only part of the loop body, like this: cmpl$1, %edi jle .L18 movl$2, %ecx movl%edi, %eax # eax = n added here jmp .L19 .L16: # top of inner loop addq$4, %rsi movl%ecx, -4(%rsi) movl%eax, %edi # n = tmp still here .L19:# inner loop entry point #movl%edi, %eax # eax = n removed from here in inner/outer loop cltd idivl %ecx testl %edx, %edx je .L16 # bottom of inner movl%edi, %eax# eax = n also added here, in the outer-only part incl%ecx cmpl%edi, %ecx jle .L19 #
[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1 from Peter Cordes --- This LEA stuff is part of what gcc does to align the stack by 32 for spilling AVX locals. Gcc's stack-align sequence is over-complicated and ties up an extra register for the whole function (add volatile to the local and see the -O3 code). Or at least it was; it seems gcc8 trunk just makes a stack frame with EBP / RBP but references 32-byte aligned locals from aligned RSP instead of unaligned RBP. It used to copy the address of the return address to make a full copy of ret-addr / saved-RBP for the aligned stack frame, which was super weird. https://godbolt.org/g/RLJNtd. (With an alloca or something, gcc8 does the same crazy stack-frame stuff as gcc7, otherwise it's much cleaner, like clang) The actual bug here is that it's not fully optimized away when it turns out that no 32-byte spills / reloads from locals are left in the function. gcc for x86-64 sometimes has a few leftover instructions like that in more complex functions using __m256; this is not exclusively an i386 problem, but it's happens more easily for 32-bit it seems.
[Bug c++/69560] x86_64: alignof(uint64_t) produces incorrect results with -m32
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69560 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #23 from Peter Cordes --- Just to recap the current situation (gcc/g++ 8.0.1 20180425): I ported David Marillat's testcase to work as C or C++ https://godbolt.org/g/QdG2V6. (And changed it to set global variables instead of calling printf, so you can see the results from looking at the asm output instead of running it). C++11 alignof() now agrees with C11 alignof() (which didn't change) that alignof(int64_t) is 4 when targeting the i386 System V ABI. Previously G++'s alignof() reported 8, while gcc's C11 alignof (stdalign.h) reported 4. That was the only change: struct-member alignof results are unchanged, and already matched between C11 and C++11. 4 is the minimum alignment that *any* int64_t, or pointer to int64_t, is assumed to have when generating code for i386 SysV. gcc / g++ are allowed to generate code that breaks if passed a pointer to int64_t that wasn't 4-byte aligned. (Auto-vectorization is one case where that can happen on x86: https://stackoverflow.com/q/47510783/224132). They're *not* allowed to assume that it's 8-byte aligned unless they can see the definition and know that a particular int64_t object is over-aligned, e.g. to its natural alignment of 8, like gcc chooses to do whenever possible (i.e. outside structs). So in both C++ and C (and in g++/gcc after this patch), alignof(int64_t) is the minimum that any allocator must give an int64_t for correctness (in this funky 32-bit ABI), not the recommended alignment that gcc and g++ both already used whenever ABI struct-packing rules didn't constrain them. It's also the guaranteed minimum that code can *assume*. e.g. a manually-vectorized library function might check alignof(T) == sizeof(T) before assuming that using 16-byte aligned loads/stores can line up with element boundaries. (An array inside a struct { int foo; int64_t arr[10]; } would violate this for i386 SysV). Anyway, I think use-cases like these are why the standard is worded the way it is, and why it makes sense for alignof() to report the guaranteed/required minimum. The recommended or actual alignment is useful, too, though, for other cases, so it's nice that GNU __alignof() is also available to report that. Semi-related: gcc depends on 8-byte alignment for C11 _Atomic int64_t but still fails to provide it inside structs on the i386 SysV ABI (Bug 65146), using the same alignment rules as regular int64_t. C++11 std::atomic is fine, getting the required natural alignment even on i386 SysV so SSE2 movq is atomic and lock add is efficient. This change to what alignof() reports in C++ had no effect on C at all, or on any alignment choices made by the compiler in either C or C++. I only mention it as another interesting case where i386 SysV's under-alignment of 64-bit types requiring special care, but that one will require an ABI change of some sort to fix.
[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274 --- Comment #2 from Peter Cordes --- The stray LEA bug seems to be fixed in current trunk (9.0.0 20180429), at least for this testcase. Gcc's stack-alignment strategy seems to be improved overall (not copying the return address when not needed), so probably it's really fixed. It's still present in 7.3.
[Bug tree-optimization/85585] New: switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a st
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585 Bug ID: 85585 Summary: switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a string table Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Bug 84011 shows some really silly code-gen for PIC code and discussion suggested using a table of offsets instead of a table of actual pointers, so you just need one base address. A further optimization is possible when the strings are all similar length, and/or the longest one isn't much longer than a pointer: Pad all strings to the same length with trailing 0 bytes, and calculate a pointer instead of loading it from an array. This removes the possibility of multiple entries sharing the same suffix (which is a missed optimization gcc wasn't already doing), but avoids needing any space for storing pointers in memory at all. In the case discussed in bug 84011 (Linux's phy.h const char *phy_modes(phy_interface_t interface)), the longest strings are 11 bytes (including the \0), and there are 23 of them. So it takes 253 bytes of char data to store everything (not counting the "unknown" for the default: special case) with all strings padded to 11 bytes. The current strings + pointer-table implementation doesn't merge string literals where one string is a suffix of another; this is another a missed-optimization that would save many bytes here. (e.g. instead of .string "mii" and .string "gmii", just have .LC4 .byte 's'; .LC3: .byte 'g'; .LC2: .string "mii".) That optimization plus byte or 16-bit offsets into the table would be nice and compact, and most CPUs have efficient zero-extending narrow loads. So for cases where the other optimization I'm suggesting isn't good, that would probably be best. The current packed string-data takes 158 bytes , so with 4-byte offsets it takes 158+23*4 = 250 bytes. Or with 8-byte pointers/offsets, it takes 158 + 23*8 = 342 bytes. Or with 1-byte offsets, 158 + 23*1 = 181 bytes: load with movzbl. (If you can't use the offset directly as an 8-byte memory source operand for ADD to a pointer, there's no point making it 32 bits instead of 8.) The code for *using* such a table is quite simple. This C source compiles to what I'm suggesting: https://godbolt.org/g/E8J3iS struct foo { char str[11]; } const table[23] = {}; const char *lookup(unsigned long idx) { if(idx > 23) { return "unknown"; //idx=23; } return table[idx].str; } Multiply by 11 only takes 2 LEA instructions on x86, so for PIC code with a RIP-relative LEA we end up with 4 ALU instructions total to get a string address, after checking the if condition: # gcc7.3 -march=haswell -O3 -fPIE output: https://godbolt.org/g/qMzaY8 leaq.LC0(%rip), %rax# "unknown" cmpq$23, %rdi ja .L4 # branchless is also an option leaq(%rdi,%rdi,4), %rax leaqtable(%rip), %rdx # RIP-relative table base address leaq(%rdi,%rax,2), %rax addq%rdx, %rax # table + 11*idx .L4: ret This is even better in no-PIE mode where a static address is usable as a signed 32-bit immediate: lookup(unsigned long): movl$.LC0, %eax cmpq$23, %rdi ja .L4 leaq(%rdi,%rdi,4), %rax leaqtable(%rdi,%rax,2), %rax# 3 cycle latency for 3-component LEA on SnB-family .L4: ret So this has extremely low code-size cost on x86-64, for the benefit of removing a table load in the dependency chain from enum to string data. It does cost significant data size vs. a byte-offset table with suffix-merging, but it's better than what gcc is doing now in non-PIE (table of qword pointers), and *much* better in PIE (insane jump table). - The byte-index version is equivalent to transforming the C source like this: const char packedstrings[158] = {}; const unsigned char offsets[23] = {}; const char *lookup_byteidx(unsigned long idx) { if(idx>23) return "unknown"; return &packedstrings[offsets[idx]]; } leaq.LC0(%rip), %rax # "unknown" cmpq$23, %rdi ja .L9 leaqoffsets(%rip), %rax leaqpackedstrings(%rip), %rdx movzbl (%rax,%rdi), %eax addq%rdx, %rax .L9: ret We can save an instruction here by making the relativ
[Bug tree-optimization/85585] switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a string
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585 --- Comment #1 from Peter Cordes --- By comparison, the no-PIE table of pointers only needs one instruction: movqCSWTCH.4(,%rdi,8), %rax So all my suggestions cost 1 extra instruction on x86 in no-PIE mode, but at a massive savings in data size. clang -fPIE compiles the plain switch to the obvious / sane 2 instruction sequence which should be our baseline for normal cases. # clang6.0 -fPIE -O3 (switch compilers on the Godbolt link) leaq.Lswitch.table.phy_modes(%rip), %rcx movq(%rcx,%rax,8), %rax Clang is willing to make a table that needs relocations for the entries. (My suggestions all avoid that because they're based on offsets, not a table of pointers. Avoiding rodata relocations that dirty a page and prevent sharing has some non-zero value, although it's low on many architectures where memory is cheap.)
[Bug tree-optimization/84011] Optimize switch table with run-time relocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #9 from Peter Cordes --- (In reply to rguent...@suse.de from comment #4) > An optimization would be to > add an indirection by, say, only recording the constant offset > into an "array of strings" in the table, thus effectively > > "case1\0case2\0..."[CSWITCH[i]] > > which would require only a relocation to access the single string > constant. But it would prohibit cases of string merging within > those strings unless we implement that as well for this optimization. gcc already totally misses optimizations here where one string is a suffix of another. "mii" could just be a pointer to the 3rd byte of "sgmii", but we instead duplicate all the characters. That's where major savings are possible for this function. > Note this might be profitable unconditionally, not just with -fpie/pic > as the CSWITCH table would be smaller (dependent on the total > size of the merged string). Indeed, I wrote up bug 85585 with ideas for optimizing this. A table of byte or uint16_t offsets into a static buffer of packed strings looks good for PIC and for position-dependent. To avoid any runtime relocations, all you need is the ability to get a static address into a register (e.g. RIP-relative LEA) and do an indexed load relative to it, just like using a normal static char[]. Then add the load result to that address. Runtime relocation is nice to avoid even if you don't *need* to avoid it. Also possible is padding each string out to a constant length and calculating an index into that, removing a level of indirection. (Good when strings are similar length and/or all short, and there aren't many strings that are duplicates or suffixes of others.) Again you just need to get a static address into a register, and add it to 11*enum_value. This is all ADD + LEA (with one of them being RIP-relative).
[Bug tree-optimization/84011] Optimize switch table with run-time relocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 --- Comment #12 from Peter Cordes --- (In reply to Jakub Jelinek from comment #10) > (In reply to Peter Cordes from comment #9) > > gcc already totally misses optimizations here where one string is a suffix > > of another. "mii" could just be a pointer to the 3rd byte of "sgmii", but > > we instead duplicate all the characters. That's where major savings are > > possible for this function. > > ?? That is the task for the linker SHF_MERGE|SHF_STRINGS handling. > Why should gcc duplicate that? Oops, right I was only looking at gcc's asm output, didn't check an actual linked binary. Will the linker currently catch a case like this? .LC_base: .LC2: .string "mii" .LC3: .string "gmii" table: .byte .LC2 - .LC_base, .LC3 - .LC_base and drop .string "mii" entirely + rewrite the table to .byte .LC3+1 - .LC_base, .LC3 - .LC_base (This discussion should probably be happening on bug 85585.) Sorry I don't know the actual mechanism by which gcc signals to the linker that it can / can't merge. I guess only in some sections? Because gcc couldn't allow it if was emitting an array like this, where dropping a string would change the offsets for later data and break offset calculations: const struct { char str[11]; } table[] = { {"mii"}, {"gmii"} };
[Bug tree-optimization/84011] Optimize switch table with run-time relocation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 --- Comment #13 from Peter Cordes --- (In reply to Jakub Jelinek from comment #10) > ?? That is the task for the linker SHF_MERGE|SHF_STRINGS handling. > Why should gcc duplicate that? Because gcc would benefit from knowing if merging makes the total block of strings for a switch() table short enough to use a uint8_t offset[] instead of uint16_t. If we don't know at compile time, we'd have to be conservative and potentially use a wider offset table. (Although as Joseph points out https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585#c2, without more linker support for this we could end up missing out on literal merging across compilation units. So perhaps a first step in applying this idea would be to use 32-bit offsets from the start of the .rodata.str1.1 section, so we can still let the linker merge strings and end up with them non-contiguous without having to force the one that gets kept to be the one that's part of our block of strings.)
[Bug tree-optimization/69615] 0 to limit signed range checks don't always use unsigned compare
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615 --- Comment #5 from Peter Cordes --- Update: https://godbolt.org/g/ZQDY1G gcc7/8 optimizes this to and / cmp / jb, while gcc6.3 doesn't. void rangecheck_var(int64_t x, int64_t lim2) { //lim2 >>= 60; lim2 &= 0xf; // let the compiler figure out the limited range of limit if (x>=0 && x=0 && x<=(INT_MAX-1)) ext(); } // clang and gcc use 2 branches
[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #14 from Peter Cordes --- I happened to look at this old bug again recently. re: extracting high the low two 32-bit elements: (In reply to Uroš Bizjak from comment #11) > > Or without SSE4 -mtune=sandybridge (anything that excluded Nehalem and other > > CPUs where an FP shuffle has bypass delay between integer ops) > > > > movd %xmm0, %eax > > movshdup %xmm0, %xmm0 # saves 1B of code-size vs. psrldq, I think. > > movd %xmm0, %edx > > > > Or without SSE3, > > > > movd %xmm0, %eax > > psrldq $4, %xmm0# 1 m-op cheaper than pshufd on K8 > > movd %xmm0, %edx > > The above two proposals are not suitable for generic moves. We should not > clobber input value, and we are not allowed to use temporary. SSE3 movshdup broadcasts the high element within each pair of 32-bit elements so movshdup %xmm0, %xmm1 movd %xmm1, %eax saves a byte of code vs pshufd / movd, and saves a uop on Merom and avoids a flt->int. (According to Agner Fog's tables, pshufd is flt->int domain, i.e. it wants input in the float domain. While movshdup ironically is only an integer shuffle.) Probably not worth looking for that optimization, though, because it's not worth using universally (Nehalem has worse latency for float shuffles between int instructions). With just SSE2, PSHUFLW is the same size as PSHUFD and faster on Merom / K8 (slowshuffle CPUs where PSHUFD is multiple uops). It's not slower on any current CPUs. I could imagine some future CPU having better throughput for 32-bit element size shuffles than 16-bit, though. That's already the case for wider lane-crossing shuffles (VPERMW YMM is multiple uops on Skylake-AVX512). This would be a definite win for tune=core2 or k8, and Pentium M, but those are so old it's probably not worth adding extra code to look for it. I think it's pretty future-proof, though, unless Intel or AMD add an extra shuffle unit for element sizes of 32-bit or wider on another port.
[Bug target/80820] _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 --- Comment #5 from Peter Cordes --- AVX512F with marge-masking for integer->vector broadcasts give us a single-uop replacement for vpinsrq/d, which is 2 uops on Intel/AMD. See my answer on https://stackoverflow.com/questions/50779309/loading-an-xmm-from-gp-regs. I don't have access to real hardware, but according to reported uop counts, this should be very good: 1 uop per instruction on Skylake-avx512 or KNL vmovq xmm0, rax1 uop p5 2c latency vpbroadcastq xmm0{k1}, rdx ; k1 = 0b00101 uop p5 3c latency vpbroadcastq ymm0{k2}, rdi ; k2 = 0b01001 uop p5 3c latency vpbroadcastq ymm0{k3}, rsi ; k3 = 0b10001 uop p5 3c latency xmm vs. ymm vs. zmm makes no difference to latency, according to InstLatx64 (For a full ZMM vector, maybe start a 2nd dep chain and vinsert to combine 256-bit halves. Also means only 3 k registers instead of 7) vpbroadcastq zmm0{k4}, rcx ; k4 =0b1 3c latency ... filling up the ZMM reg Starting with k1 = 2 = 0b0010, we can init the rest with KSHIFT: mov eax, 0b0010 = 2 kmovwk1, eax KSHIFTLW k2, k1, 1 KSHIFTLW k3, k1, 2 # KSHIFTLW k4, k1, 3 ... KSHIFT runs only on port 5 (SKX), but so does KMOV; moving from integer registers would just cost extra instructions to set up integer regs first. It's actually ok if the upper bytes of the vector are filled with broadcasts, not zeros, so we could use 0b1110 / 0b1100 etc. for the masks. We could start with kxnor to generate a -1 and left-shift that, but that's 2 port5 uops vs. mov eax,2 / kmovw k1, eax being p0156 + p5. Loading k registers from memory is not helpful: according to IACA, it costs 3 uops. (But that includes p237, and a store-AGU uop makes no sense, so it might be wrong.)
[Bug rtl-optimization/86352] New: setc/movzx introduced into loop to provide a constant 0 value for a later rep stos
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352 Bug ID: 86352 Summary: setc/movzx introduced into loop to provide a constant 0 value for a later rep stos Product: gcc Version: 9.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* The wrong-code bug 86314 also revealed some very weird code-gen decisions, which the fix didn't improve. (I think the lock bts peephole is seen pretty late, and that's one necessary factor for this problem. But even without it, an unnecessary data dependency between the lock bts loop and clearing memory is silly.) This ended up being about 5 separate bugs, but IDK which belong together or are already reported: * useless mov %rsi, %rcx and useless mov %rdx, %rdi * using setc/movzx instead of xor %eax,%eax to get a constant 0; slower and creating a data dependency * Doing that inside the loop instead of after * Not adjusting register allocation to allow xor / set-flags / setc * rep stos vs. vector stores as a zeroing strategy vs. any other repeated value. The reproducer test-case for bug 86314 loops until it finds and claims a zero bit in a uint64_t, then returns a Bucket() object (with a constructor that zero-initializes it) with no data dependency on anything. But gcc decides to introduce a flag -> integer 0/1 inside the acquire() loop instead of just using xor eax,eax before rep stosq. The loop can only exit when CF = 0, so RAX = 0, so it's not a correctness problem. The loop is branching on CF as set by BTS, so there's no need to have the 0/1 in a register at all inside the loop, and setc/movzx from a known-zero CF is more expensive that xor-zeroing. (Plus it gives the STOSQ a data dependency on the LOCK BTS flag result which it wouldn't have otherwise. The stores can't commit until after the lock memory barrier, but they can execute.) This is the actual code-gen from (GCC-Explorer-Build) 9.0.0 20180627 https://godbolt.org/g/XGF5tR BucketMap::acquireBucket(): movq%rdi, %rdx movq%rsi, %rcx # useless, lock bts can use (%rsi) .L2: movq(%rsi), %rax andl$1, %eax# source is simplified to only check positions 0 or 1 lock btsq %rax, (%rcx) # Why not (%rsi)? setc%al movzbl %al, %eax # xor / bts / setc would have been possible with a different reg jc .L2 # rax = 0 because the loop can only exit when CF=0 # should use xor %eax,%eax here instead movq%rdx, %rdi # Useless, RDI still == RDX movl$16, %ecx rep stosq movq%rdx, %rax # can't be done before rep stosq: RAX needs to be 0 ret With -m32, where 64-bit lock bts isn't available, we have lock cmpxchg8b ending with an OR. So there is a zero in an integer register from that, but it's not in EAX, so the code gen includes an extra `mov %esi, %eax`, which is not cheaper than xor %eax,%eax especially with -march=haswell. Sandybridge-family has xor-zeroing as cheap as a NOP, but mov-elimination isn't always perfect and SnB itself doesn't have it. And of course mov still has a data dependency on the source of the zero, so it defeats the effect of branch prediction + speculative breaking (control) dependencies. This last applies on any out-of-order x86. I guess the lock bts peephole is seen too late to notice that it can't recycle the 0 from the loop condition anymore, and ends up generating code to materialize it. But why inside the loop? -- Even if we *did* need an integer 0/1 in a register inside the loop, we could still use the xor / set-flags / setcc optimization: Simply use a register other than RAX for the load / AND $1 / bts source. And you can hoist the xor-zeroing out of the loop. xor %eax, %eax .L2: movq(%rsi), %rcx andl$1, %ecx lock btsq %rax, (%rsi) setc%al # use %rax jc .L2 --- Separately: If the initializer is non-zero, it uses SSE or AVX stores. That makes no sense either: if rep stosq is optimal, use mov eax, 1 for the all-ones case. (See the ifdef in the Godbolt link to try it) If it's not optimal, use xorps xmm0,xmm0 to create an all-zero vector. I guess gcc is checking for all-zeros as a common special case, but doesn't check for repeats of any other value, except for repeated bytes recognized as memset. So it makes sense that gcc uses a different strategy, but I think for only 16x 8 bytes (128 bytes) that vector stores beat rep stos on current CPUs. (That may change when IceLake introduces fast
[Bug tree-optimization/91026] switch expansion produces a jump table with trivial entries
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #3 from Peter Cordes --- (In reply to Martin Liška from comment #2) > Switch conversion bails out because it knowns that a jump table (or a bit > test can) be used for this snippet. Then we prefer to use a jump table then > a bit test. With -fno-jump-tables we generate the same code. > That said, I confirm it's a small limitation. This regression appeared in GCC9 for this test-case, and is present in GCC9.1 on Godbolt: https://godbolt.org/z/fDjTxN bool is_vowel(char c) { switch (c) { case 'a': case 'e': case 'i': case 'o': case 'u': case 'y': return 1; default: return 0; } } But simplifying it case 'a': case 'e': case 'i': to those 3 cases gets gcc9 and trunk to use an immediate bitmap. With gcc8 and earlier, the x86-64 asm for the 2 versions is identical except for the immediate used with TEST EAX, imm32. (And BTW, there's a missed optimization here of using mask & (1<>n) & 1. Or better, looking for that conversion in user source code / logic because people often write tests that way requiring the creation of an actual 1 in a register. Or for ISAs with flags, have the mask already right-shifted by 1 so the bit shifted out is the one we want. Then CF = result with no extra test. Also an x86 missed optimization: BT reg,reg is very efficient (single uop) on Intel and Ryzen, and avoids needing a 3-uop-on-Intel shift-by-CL or a mov reg,1 I'll report these ideas separately if/when I get around to it.
[Bug c/91398] Possible missed optimization: Can a pointer be passed as hidden pointer in x86-64 System V ABI
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91398 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4 from Peter Cordes --- EAD neglected to link previous discussion about this in the initial bug report. https://stackoverflow.com/a/57377890/224132 points out that the SysV ABI wording is > If the type has class MEMORY, then **the caller provides space** for the > return value and passes the address of this storage in %rdi We can argue semantics, but in my answer on the same question, I argued that the implication is that that space won't alias any other space. (Because the return-value object exists in the C abstract machine, so the default assumption should be that it exists for real in the calling convention.) Whether it's practical to look for this optimization or not, I'm still curious about the point that @M.M made about the semantics of restrict https://stackoverflow.com/questions/57377314/what-prevents-the-usage-of-a-function-argument-as-hidden-pointer/57436765#comment101288442_57403379 Does the callee do_something() reading a global count as happening inside the block scope of use(Vec3 *restrict out) { ... }? The ISO C standard wording talks about reaching the end of a block, which hasn't happened even though `out` is not in scope inside the other function. If so, then calling use(&global) creates UB when *out = do_something(); executes because it writes the pointed-to memory via a restrict-pointer in the same block where it reads it from a pointer that's not derived from out. If so, restrict would make this optimization safe if we can prove that do_something is "noexcept" and doesn't longjmp.
[Bug middle-end/91515] missed optimization: no tailcall for types of class MEMORY
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91515 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1 from Peter Cordes --- The real missed optimization is that GCC is returning its own incoming arg instead of returning the copy of it that create() will return in RAX. This is what blocks tailcall optimization; it doesn't "trust" the callee to return what it's passing as RDI. See https://stackoverflow.com/a/57597039/224132 for my analysis (the OP asked the same thing on SO before reporting this, but forgot to link it in the bug report.) The RAX return value tends to rarely be used, but probably it should be; it's less likely to have just been reloaded recently. RAX is more likely to be ready sooner than R12 for out-of-order exec. Either reloaded earlier (still in the callee somewhere if it's complex and/or non-leaf) or never spilled/reloaded. So we're not even gaining a benefit from saving/restoring R12 to hold our incoming RDI. Thus it's not worth the extra cost (in code-size and instructions executed), IMO. Trust the callee to return the pointer in RAX.
[Bug target/82887] ICE: in extract_insn, at recog.c:2287 (unrecognizable insn) with _mm512_extracti64x4_epi64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4 from Peter Cordes --- Since some code is apparently still avoiding this because of old broken GCC (e.g. https://github.com/pmem/pmdk/blob/a6031710f7c102c6b8b6b19dc9708a3b7d43e87b/src/libpmem/x86_64/memset/memset_nt_avx512f.h#L193 ) Perhaps a workaround of _mm512_castsi512_si256 would be useful? Or does that ICE as well? I can't repro the bug on Godbolt so IDK. Doing _mm512_set1_epi8(c) and a separate _mm256_set1_epi8(c) doesn't CSE with GCC, only clang. https://godbolt.org/z/uZ4lv- And if you leave out -march=skylake-avx512 you get even worse asm from GCC.
[Bug tree-optimization/92080] New: Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080 Bug ID: 92080 Summary: Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c) Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* As a workaround for PR 82887 some code (e.g. a memset) uses __m512i zmm = _mm512_set1_epi8((char)c); __m256i ymm = _mm256_set1_epi8((char)c); instead of ymm = _mm512_castsi512_si256(zmm); (found in the persistent-memory library https://github.com/pmem/pmdk/blob/a6031710f7c102c6b8b6b19dc9708a3b7d43e87b/src/libpmem/x86_64/memset/memset_nt_avx512f.h#L193 ) Obviously we'd like to CSE that instead of actually broadcasting twice. MVCE: #include __m512i sinkz; __m256i sinky; void foo(char c) { sinkz = _mm512_set1_epi8(c); sinky = _mm256_set1_epi8(c); } https://godbolt.org/z/CeXhi8 g++ (Compiler-Explorer-Build) 10.0.0 20191012 # g++ -O3 -march=skylake-avx512 (AVX512BW + AVX512VL are the relevant ones) foo(char): vpbroadcastb%edi, %zmm0 vmovdqa64 %zmm0, sinkz(%rip) vpbroadcastb%edi, %ymm0 # wasted insn vmovdqa64 %ymm0, sinky(%rip) # wasted EVEX prefix vzeroupper ret Without AVX512VL it wastes even more instructions (vmovd + AVX2 vpbroadcastb xmm,ymm), even though AVX512BW vpbroadcastb zmm does set the YMM register. (There are no CPUs with AVX512BW but not AVX512VL; if people compile that way it's their own fault. But this might be relevant for set1_epi32() on KNL). Clang finds this optimization, and uses a shorter vmovdqa for the YMM store saving another 2 bytes of code size: vpbroadcastb%edi, %zmm0 vmovdqa64 %zmm0, sinkz(%rip) vmovdqa %ymm0, sinky(%rip) vzeroupper ret
[Bug target/82887] ICE: in extract_insn, at recog.c:2287 (unrecognizable insn) with _mm512_extracti64x4_epi64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887 --- Comment #5 from Peter Cordes --- Reported bug 92080 for the missed CSE
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 --- Comment #5 from Peter Cordes --- (In reply to Jakub Jelinek from comment #4) > Can't reproduce. It is true that we now emit the __atomic_load_16 call, but > that was intentional change Yup. >, and it can't be easily tail call, because the > tailcall pass doesn't understand that the low 8 bytes of the 16 byte > structure are returned the same as the whole structure Ok that's disappointing, but hopefully is very rare after inlining. > But I certainly can't reproduce any significant value masking etc., tried > r235002 (+- gcc 6 branchpoint), r247000 (+- gcc 7 branchpoint) as well as > current trunk. > Unless it is something that has been broken on the 7 branch and later fixed. > > But have just tried gcc 7.1.0 release and can't reproduce even there. I can't repro it locally with gcc7.1.1 either. This is the version info from -fverbose-asm on the godbolt.org link (which does still repro it) # GNU C++11 (GCC-Explorer-Build) version 7.1.0 (x86_64-linux-gnu) # compiled by GNU C version 5.4.0 20160609, GMP version 6.1.0, MPFR version 3.1.4, MPC version 1.0.3, isl version isl-0.16.1-GMP It's not present in the gcc7.2 build on Godbolt.org either. I asked Matt Godbolt what exact version the compiler explorer site is using for the gcc7.1.0 dropdown (https://github.com/mattgodbolt/compiler-explorer/issues/684). Hopefully he can help us track down a gcc SVN revision to repro it, or confirm that it was a misconfigured or buggy gcc or something. Just to rule out the possibility of a now-dormant bug lurking somewhere.
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 --- Comment #6 from Peter Cordes --- (In reply to Jakub Jelinek from comment #4) > But have just tried gcc 7.1.0 release and can't reproduce even there. Matt says the Compiler Explorer backend uses upstream release tarballs like `URL=ftp://ftp.gnu.org/gnu/gcc/gcc-${VERSION}/${TARBALL}`. (where TARBALL is `gcc-${VERSION}.tar.xz` for recent gcc where .xz is available). The compiler config used is https://github.com/mattgodbolt/compiler-explorer-image/blob/master/gcc/build/build.sh#L78: CONFIG="" CONFIG+=" --build=x86_64-linux-gnu" CONFIG+=" --host=x86_64-linux-gnu" CONFIG+=" --target=x86_64-linux-gnu" CONFIG+=" --disable-bootstrap" CONFIG+=" --enable-multiarch" CONFIG+=" --with-abi=m64" CONFIG+=" --with-multilib-list=m32,m64,mx32" CONFIG+=" --enable-multilib" CONFIG+=" --enable-clocale=gnu" CONFIG+=" --enable-languages=c,c++,fortran" # used to have go, but is incompatible with m32/mx32 CONFIG+=" --enable-ld=yes" CONFIG+=" --enable-gold=yes" CONFIG+=" --enable-libstdcxx-debug" CONFIG+=" --enable-libstdcxx-time=yes" CONFIG+=" --enable-linker-build-id" CONFIG+=" --enable-lto" CONFIG+=" --enable-plugins" CONFIG+=" --enable-threads=posix" CONFIG+=" --with-pkgversion=GCC-Explorer-Build" BINUTILS_VERSION=2.29.1 Does that help figure out how to build a gcc7.1.0 that can repro this?
[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 80846, which changed state. Bug 80846 Summary: auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 What|Removed |Added Status|RESOLVED|REOPENED Resolution|FIXED |---
[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 Peter Cordes changed: What|Removed |Added Status|RESOLVED|REOPENED Resolution|FIXED |--- --- Comment #21 from Peter Cordes --- (In reply to Richard Biener from comment #20) > Fixed. Unfortunately only fixed for integer, not FP. The OpenMP and vanilla float array sum functions from the godbolt link in the initial bug report still use 256b shuffles, including a gratuitous vperm2f128 when the upper half isn't used, so vextractf128 would have done the same job in 1 uop on Ryzen instead of 8. Even on Intel CPUs, they're optimized for code-size, not performance (vhaddps instead of shuffle / vaddps). Remember that Intel CPUs with AVX only have one FP shuffle unit. (Including Sandy/Ivybridge, which has 2 integer-128 shuffle units) float sumfloat_autovec(const float arr[]) { arr = __builtin_assume_aligned(arr, 64); float sum=0; for (int i=0 ; i<1024 ; i++) sum = sum + arr[i]; return sum; } # gcc 20180113 -mavx2 -ffast-math -O3 # (tune=generic, and even with arch=znver1 no-prefer-avx128) ... vhaddps %ymm0, %ymm0, %ymm0 vhaddps %ymm0, %ymm0, %ymm1 vperm2f128 $1, %ymm1, %ymm1, %ymm0 # why not vextract? vaddps %ymm1, %ymm0, %ymm0 # gratuitous 256b vzeroupper This bug is still present for FP code: it narrows from 256b to scalar only in the last step. Every VHADDPS is 2 shuffles + 1 add on Intel. They're in-lane shuffles, but it's still 2 uops for port5 vs. VSHUFPS + VADDPS. (Costing an extra cycle of latency because with only 1 shuffle port, the 2 interleave-shuffles that feed a vertical-add uop can't run in the same cycle.) (V)HADDPS with the same input twice is almost never the best choice for performance. On Ryzen it's an even bigger penalty: HADDPS xmm is 4 uops (vs. 3 on Intel). It's also 7c latency (vs. 3 for ADDPS). 256b VHADDPS ymm is 8 uops, one per 3 cycle throughput, and Agner Fog reports that it's "mixed domain", i.e. some kind of penalty for ivec / fp domain crossing. I guess the shuffles it uses internally are ivec domain? With multiple threads on the same core, or even with ILP with surrounding code, uop throughput matters as well as latency, so more uops is worse even if it didn't have latency costs. The sequence I'd recommend (for both Intel and AMD) is: (See also http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86/35270026#35270026) vextractf128$1, %ymm0, %xmm1 vaddps %xmm1, %xmm0, %xmm0 # narrow to 128b vmovshdup %xmm0, %xmm0, %xmm1 # copy high->low in each pair vaddps %xmm1, %xmm0, %xmm0 vmovhlps%xmm0, %xmm0, %xmm1 # duplicate high 64b vaddps %xmm1, %xmm0, %xmm0 The MOVSHDUP / MOVHLPS sequence is also what you want without VEX, so you can do a 128b hsum with 4 instructions, with no MOVAPS. Intel: 6 uops total, 3 shuffles. vs. 8 total, 5 shuffles AMD Ryzen: 6 uops, 3 shuffles. vs. 26 total uops, 20 of them shuffles. And much worse latency, too. Even just fixing this specific bug without fixing the rest of the sequence would help AMD *significantly*, because vextractf128 is very cheap, and vhaddps xmm is only half the uops of ymm. (But the latency still sucks). - Even for integer, this patch didn't fix the MOVDQA + PSRLDQ that we get without AVX. PSHUFD or PSHUFLW to copy+shuffle is cheaper. I guess I need to report that bug separately, because it probably won't get fixed soon: if I understand correctly, there's no mechanism for the back-end to tell the auto-vectorizer what shuffles it can do efficiently! It usually won't make too much difference, but for very small arrays (like 8 `int` elements) the hsum is a big part of the cost, although it's probably still worth auto-vectorizing *if* you can do it efficiently.
[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #22 from Peter Cordes --- Forgot the Godbolt link with updated cmdline options: https://godbolt.org/g/FCZAEj.
[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #25 from Peter Cordes --- We're getting a spill/reload inside the loop with AVX512: .L2: vmovdqa64 (%esp), %zmm3 vpaddd (%eax), %zmm3, %zmm2 addl$64, %eax vmovdqa64 %zmm2, (%esp) cmpl%eax, %edx jne .L2 Loop finishes with the accumulator in memory *and* in ZMM2. The copy in ZMM2 is ignored, and we get # narrow to 32 bytes using memory indexing instead of VEXTRACTI32X8 or VEXTRACTI64X4 vmovdqa 32(%esp), %ymm5 vpaddd (%esp), %ymm5, %ymm0 # braindead: vextracti128 can write a new reg instead of destroying xmm0 vmovdqa %xmm0, %xmm1 vextracti128$1, %ymm0, %xmm0 vpaddd %xmm0, %xmm1, %xmm0 ... then a sane 128b hsum as expected, so at least that part went right.
[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #28 from Peter Cordes --- (In reply to Richard Biener from comment #27) > Note that this is deliberately left as-is because the target advertises > (cheap) support for horizontal reduction. The vectorizer simply generates > a single statement for the reduction epilogue: > [...] > so either the target shouldn't tell the vectorizer it supports this or > it simply needs to expand to better code. Which means - can you open > a separate bug for this? Yes; I was incorrectly assuming the inefficient asm had the same cause as before. I agree *this* is fixed, thanks for the explanation of how gcc was arriving at this sequence. I'll have a look at the backend canned sequence defs and see if there are any other sub-optimal ones, or if it was only AVX. Having canned sequences for different target instruction sets instead of leaving it to arch-independent code seems like it should be an improvement over the old design.
[Bug target/38959] Additional switches to disallow processor supplementary instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #3 from Peter Cordes --- We can maybe close this as fixed (if -march=i386 didn't exist/work at the time) or invalid. Or maybe we want to add some CPU-level awareness to code-gen for __builtin_ia32_rdtsc / rdpmc / rdtscp. The cmov / fcomi / fcomi proposed switches are already supported as part of -march=pentium -mtune=generic or lower, e.g. -march=i386. (The 32-bit default is something like arch=i686 and tune=generic, with it being possible to configure gcc so SSE2 is on by default in 32-bit code.) Those are the important ones, because they're emitted automatically by the compiler's back-end. The other options would just be trying to save you from yourself, e.g. rejecting source that contains __rdtsc() / __builtin_ia32_rdtsc() I'm not sure what the situation is with long NOPs. GCC doesn't (normally?) emit them, just using .p2align directives for the assembler. In 32-bit mode, GAS appears to avoid long NOPs, using either 2-byte xchg ax,ax or pseudo-nops like LEA esi,[esi+eiz*1+0x0] that add a cycle of latency to the dep chain involving ESI. Even with -march=haswell, gcc+gas fail to use more efficient long NOPs for padding between functions. --- I'm not sure if CPUID is ever emitted by gcc's back-end directly, only from inline asm. i386/cpuid.h uses inline asm. But __get_cpuid_max() checks if CPUID is even supported in a 386-compatible way, checking if a bit in EFLAGS is sticky or not. If your source code is written safely, you won't have a problem unless possibly __builtin_cpu_init runs CPUID without checking, in programs that use __builtin_cpu_supports() or _is(). __builtin_ia32_rdpmc() and __rdtsc() do *not* check -march= before emitting rdpmc and rdtsc. Neither does __rdtscp(), which is interesting because that instruction is new enough that some still-relevant CPUs don't support it. __rdpmc() isn't "volatile", though, so stop-start optimizes to 0. (I found this bug looking for existing reports of that issue.) Test cases: https://godbolt.org/z/hqPdza FCMOV and CMOV are also handled correctly, but I didn't write functions for them. int fcomi(double x, double y) { return x Proposed switches: > > --nocpuid This option causes the compiler to not generate cpuid opcodes > --nocmov This option causes the compiler to not generate cmov opcodes > --nofcmov This option causes the compiler to not generate fcmov opcodes > --nofcomi This option causes the compiler to not generate fcomi opcodes > --nonopl This option causes the compiler to not generate fcomi opcodes > --nordpmc This option causes the compiler to not generate rdpmc opcodes > --nordtsc This option causes the compiler to not generate rdtsc opcodes > > Possibly a general switch that is equivalent to all of the above > > --nosupplementaryinstructions > > Rationale > > It is possible that a developer still wants to compile for a particular > architecture (for example the i486), but does not wish to generate code with > supplementary instructions (such as cpuid), that may be present on that > architecture.
[Bug target/38959] Additional switches to disallow processor supplementary instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959 --- Comment #4 from Peter Cordes --- The __builtin_ia32_rdpmc being a pure function bug I mentioned in my previous comment is already reported and fixed (in gcc9 only): bug 87550 It was present since at least gcc 5.0 https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/595214
[Bug target/80571] AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571 --- Comment #2 from Peter Cordes --- I think hjl's patch for PR 89071 / PR 87007 fixes (most of?) this, at least for AVX. If register pressure is an issue, using a reg holding a arbitrary constant (instead of xor-zeroed) is a valid option, as this bug points out. So I'm not sure we should close this as a duplicate of those fixed bugs.
[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #22 from Peter Cordes --- Nice, that's exactly the kind of thing I suggested in bug 80571. If this covers * vsqrtss/sd (mem),%merge_into, %xmm * vpcmpeqd%same,%same, %dest# false dep on KNL / Silvermont * vcmptrueps %same,%same, %ymm # splat -1 without AVX2. false dep on all known uarches as well as int->FP conversions, then we could probably close that as fixed by this as well. bug 80571 does suggest that we could look for any cold reg, like a non-zero constant, instead of requiring an xor-zeroed vector, so it might go slightly beyond what this patch does. And looking for known-to-be-ready dead regs from earlier in the same dep chain could certainly be useful for non-AVX code-gen, allowing us to copy-and-sqrt without introducing a dependency on anything that's not already ready. (In reply to h...@gcc.gnu.org from comment #21) > Author: hjl > Date: Fri Feb 22 15:54:08 2019 > New Revision: 269119
[Bug target/88809] do not use rep-scasb for inline strlen/memchr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4 from Peter Cordes --- Yes, rep scasb is abysmal, and gcc -O3's 4-byte-at-a-time scalar loop is not very good either. With 16-byte alignment, (which we have from calloc on x86-64 System V), we can inline a *much* better SSE2 loop. See https://stackoverflow.com/a/55589634/224132 for more details and microbenchmarks; On Skylake it's about 4 to 5x faster than the current 4-byte loop for large strings, 3x faster for short strings. For short strings (strlen=33), it's about 1.5x faster than calling strlen. For very large strings (too big for L2 cache), it's ~1.7x slower than glibc's AVX2 strlen. The lack of VEX encoding for pxor and pmovmskb is just me being lazy; let gcc emit them all with VEX if AVX is enabled. # at this point gcc has `s` in RDX, `i` in ECX pxor %xmm0, %xmm0 # zeroed vector to compare against .p2align 4 .Lstrlen16: # do { #ifdef __AVX__ vpcmpeqb (%rdx), %xmm0, %xmm1 #else movdqa (%rdx), %xmm1 pcmpeqb%xmm0, %xmm1 # xmm1 = -1 where there was a 0 in memory #endif add $16, %rdx # ptr++ pmovmskb %xmm1, %eax # extract high bit of each byte to a 16-bit mask test %eax, %eax jz.Lstrlen16# }while(mask==0); # RDX points at the 16-byte chunk *after* the one containing the terminator # EAX = bit-mask of the 0 bytes, and is known to be non-zero bsf%eax, %eax # EAX = bit-index of the lowest set bit # terminator is at rdx+rax - 16 # movb $'A', -16(%rdx, %rax) // for a microbench that used s[strlen(s)]='A' sub%rbp, %rdx # p -= start lea -16(%rdx, %rax) # p += byte_within_vector - 16 We should actually use REP BSF because that's faster on AMD (tzcnt), and same speed on Intel. Also an inline-asm implementation of it with a microbenchmark adapted from the SO question. (Compile with -DUSE_ASM -DREAD_ONLY to benchmark a fixed length repeatedly) https://godbolt.org/z/9tuVE5 It uses clock() for timing, which I didn't bother updating. I made it possible to run it for lots of iterations for consistent timing. (And so the real work portion dominates the runtime so we can use perf stat to measure it.) If we only have 4-byte alignment, maybe check the first 4B, then do (p+4) & ~7 to either overlap that 4B again or not when we start 8B chunks. But probably it's good to get to 16-byte alignment and do whole SSE2 vectors, because repeating an aligned 16-byte test that overlaps an 8-byte test costs the same as doing another 8-byte test. (Except on CPUs like Bobcat that split 128-bit vectors into 64-bit halves). The extra AND to round down to an alignment boundary is all it takes, plus the code-size cost of peeling 1 iteration each of 4B and 8B before a 16-byte loop. We can use 4B / 8B with movd / movq instead of movdqa. For pmovmskb, we can ignore the compare-true results for the upper 8 bytes by testing the result with `test %al,%al`, or in general with `test $0x0F, %al` to check only the low 4 bits of EAX for the 4-byte case. The scalar bithack version can use BSF instead of CMOV binary search for the byte with a set high bit. That should be a win if we ever wanted to do scalar on some x86 target especially with 8-byte registers, or on AArch64. AArch64 can rbit / clz to emulate bsf and find the position of the first set bit. (Without efficient SIMD compare result -> integer_mask, or efficient SIMD -> integer at all on some ARM / AArch64 chips, SIMD compares for search loops aren't always (ever?) a win. IIRC, glibc strlen and memchr don't use vectors on ARM / AArch64, just scalar bithacks.)
[Bug target/90568] New: stack protector should use cmp or sub, not xor, to allow macro-fusion on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 Bug ID: 90568 Summary: stack protector should use cmp or sub, not xor, to allow macro-fusion on x86 Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* cmp/jne is always at least as efficient as xor/jne, and more efficient on CPUs that support macro-fusion of compare and branch. Most support cmp/jne fusion (including all mainstream Intel and AMD, not low-power), but none support xor/jne fusion. void foo() { volatile int buf[4]; buf[1] = 2; } gcc trunk on Godbolt, but same code-gen all the way back to gcc4.9 foo: subq$40, %rsp movq%fs:40, %rax movq%rax, 24(%rsp) xorl%eax, %eax movl$2, 4(%rsp) movq24(%rsp), %rax xorq%fs:40, %rax ## This insn should be CMP jne .L5 addq$40, %rsp ret .L5: call__stack_chk_fail As far as I can tell, the actual XOR result value in RAX is not an input to __stack_chk_fail because gcc sometimes uses a different register. Therefore we don't need it, and can use any other way to check for equality. If we need to avoid "leaking" the canary value in a register, we can use SUB, otherwise CMP is even better and can macro-fuse on more CPUs. Only Sandybridge-family can fuse SUB/JCC. (And yes, it can fuse even with a memory-source and a segment override prefix. SUB %fs:40(%rsp), %rax / JNE is a single uop on Skylake; I checked this with perf counters in an asm loop.) AMD can fuse any TEST or CMP/JCC, but only those instructions (so SUB is as bad as XOR for AMD). See Agner Fog's microarch PDF. Linux test program (NASM) that runs sub (mem), %reg with an FS prefix to prove that it does macro-fuse and stays micro-fused as a single uop: default rel %use smartalign alignmode p6, 64 global _start _start: cookie equ 12345 mov eax, 158 ; __NR_arch_prctl mov edi, 0x1002; ARCH_SET_FS lea rsi, [buf] syscall ; wrfsbase rsi; not enabled by the kernel mov qword [fs: 0x28], cookie mov ebp, 10 align 64 .loop: mov eax, cookie sub rax, [fs: 0x28] jne _start and ecx, edx dec ebp jnz .loop .end: xor edi,edi mov eax,231 ; __NR_exit_group syscall ; sys_exit_group(0) section .bss align 4096 buf:resb 4096 nasm -felf64 branch-fuse-mem.asm && ld -o branch-fuse-mem branch-fuse-mem.o to make a static executable taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,branches:u,instructions:u,uops_issued.any:u,uops_executed.thread:u -r2 ./branch-fuse-mem On my i7-6700k Performance counter stats for './branch-fuse-mem' (2 runs): 240.78 msec task-clock:u #0.999 CPUs utilized ( +- 0.23% ) 2 context-switches #0.010 K/sec ( +- 20.00% ) 0 cpu-migrations#0.000 K/sec 3 page-faults #0.012 K/sec 1,000,764,258 cycles:u #4.156 GHz ( +- 0.00% ) 2,000,000,076 branches:u# 8306.384 M/sec ( +- 0.00% ) 6,000,000,088 instructions:u#6.00 insn per cycle ( +- 0.00% ) 4,000,109,615 uops_issued.any:u # 16613.222 M/sec ( +- 0.00% ) 5,000,098,334 uops_executed.thread:u# 20766.367 M/sec ( +- 0.00% ) 0.240935 +- 0.000546 seconds time elapsed ( +- 0.23% ) Note 1.0 billion cycles (1 per iteration), and 4B fused-domain uops_issued.any, i.e. 4 uops per loop iteration. (5 uops *executed* is because one of those front-end uops has a load micro-fused). Changing SUB to CMP has no effect. With SUB changed to XOR, the loop takes 1.25 cycles per iteration, and the front-end issues 5 uops per iteration. Other counters are the same. Skylake's pipeline is 4-wide, like all Intel since Core2, so an extra uop for the front-end creates a bottleneck. -- On Intel pre Haswell, the decoders will only make at most 1 fusion per decode group, so you may need to make the loop larger to still get fusion. Or use this as the loop-branch, e.g. with a 1 in memory sub rax, [fs: 0x28] jnz .loop or with a 0 in memory, sub or cmp or xor will all set flags according to the register being non-zero. But sub or xor will introduce an extra cycle of latency on the critical path for the loop counter.
[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 --- Comment #1 from Peter Cordes --- https://godbolt.org/z/hHCVTc Forgot to mention, stack-protector also disables use of the red-zone for no apparent reason, so that's another missed optimization. (Perhaps rarely relevant; probably most functions that get stack protection are big enough that they need more stack, or non-leaf. I sidestepped that with volatile.)
[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 --- Comment #3 from Peter Cordes --- (In reply to Jakub Jelinek from comment #2) > The xor there is intentional, for security reasons we do not want the stack > canary to stay in the register afterwards, because then it could be later > spilled or accessible to some exploit in another way. Ok, so we can't use CMP, therefore we should use SUB, which as I showed does help on Sandybridge-family vs. XOR. x - x = 0 just like x ^ x = 0 Otherwise SUB wouldn't set ZF. SUB is not worse than XOR on any other CPUs; there are no CPUs with better XOR throughput than ADD/SUB. In the canary mismatch case, leaving attacker_value - key in a register seems no worse than leaving attacker_value ^ key in a register. Either value trivially reveals the canary value to an attacker that knows what they overwrote the stack with, if it does somehow leak. We jump to __stack_chk_fail in that case, not relying on the return value on the stack, so a ROP attack wouldn't be sufficient to leak that value anywhere.
[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 --- Comment #5 from Peter Cordes --- And BTW, this only helps if the SUB and JNE are consecutive, which GCC (correctly) doesn't currently optimize for with XOR. If this sub/jne is different from a normal sub/branch and won't already get optimized for macro-fusion, we may get even more benefit from this change by teaching gcc to keep them adjacent. GCC currently sometimes splits up the instructions like this: xorq%fs:40, %rdx movl%ebx, %eax jne .L7 from gcc8.3 (but not 9.1 or trunk in this case) on https://godbolt.org/z/nNjQ8u #include unsigned int get_random_seed() { std::random_device rd; return rd(); } Even with -O3 -march=skylake. That's not wrong because XOR can't macro-fuse, but the point of switching to SUB is that it *can* macro-fuse into a single sub-and-branch uop on Sandybridge-family. So we might need to teach gcc about that. So when you change this, please make it aware of optimizing for macro-fusion by keeping the sub and jne back to back. Preferably with tune=generic (because Sandybridge-family is fairly widespread and it doesn't hurt on other CPUs), but definitely with -mtune=intel or -mtune=sandybridge or later. Nehalem and earlier can only macro-fuse test/cmp The potential downside of putting it adjacent instead of 1 or 2 insns earlier for uarches that can't macro-fuse SUB/JNE should be about zero on average. These branches should predict very well, and there are no in-order x86 CPUs still being sold. So it's mostly just going to be variations in fetch/decode that help sometimes, hurt sometimes, like any code alignment change.
[Bug target/90582] New: AArch64 stack-protector wastes an instruction on address-generation
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90582 Bug ID: 90582 Summary: AArch64 stack-protector wastes an instruction on address-generation Product: gcc Version: 8.2.1 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- void protect_me() { volatile int buf[2]; buf[1] = 3; } https://godbolt.org/z/xdlr5w AArch64 gcc8.2 -O3 -fstack-protector-strong protect_me: stp x29, x30, [sp, -32]! adrpx0, __stack_chk_guard add x0, x0, :lo12:__stack_chk_guard ### this instruction mov x29, sp # frame pointer even though -fomit-frame-pointer is part of -O3. Goes away with explicit -fomit-frame-pointer ldr x1, [x0]# copy the cookie str x1, [sp, 24] mov x1,0# and destroy the reg mov w1, 3 # right before it's already destroyed str w1, [sp, 20] # buf[1] = 3 ldr x1, [sp, 24]# canary ldr x0, [x0]# key destroys the key pointer eor x0, x1, x0 cbnzx0, .L5 ldp x29, x30, [sp], 32 # FP and LR save/restore (for some reason?) ret .L5: # can the store of the link register go here, for backtracing? bl __stack_chk_fail A function that returns a global can embed the low 12 bits of the address into the load instruction. AArch64 instructions are fixed-width, so there's no reason (AFAIK) not to do this. f: adrpx0, foo ldr w0, [x0, #:lo12:foo] ret I'm not an AArch64 performance expert; it's plausible that zero displacements are worth spending an extra instruction on for addresses that are used twice, but unlikely. So we should be doing adrpx0, __stack_chk_guard ldr x1, [x0, #:lo12:__stack_chk_guard] # in prologue to copy cookie ... ldr x0, [x0, #:lo12:__stack_chk_guard] # in epilogue to check cookie This also avoids leaving an exact pointer right to __stack_chk_guard in a register, in case a vulnerable callee or code in the function body can be tricked into dereferencing it and leaking the cookie. (In non-leaf functions, we generate the pointer in a call-preserved register like x19, so yes it will be floating around in a register for callees). I'd hate to suggest destroying the pointer when copying to the stack, because that would require another adrp later. Finding a gadget that has exactly the right offset (the low 12 bits of __stack_chk_guard's address) is a lot less likely than finding an ldr from [x0]. Of course this will introduce a lot of LDR instructions with an #:lo12:__stack_chk_guard offset, but hopefully they won't be part of useful gadgets because they lead to writing the stack, or to EOR/CBNZ to __stack_chk_fail I don't see a way to optimize canary^key == 0 any further, unlike x86-64 PR 90568. I assume EOR / CBNZ is as at least as efficient as SUBS / BNE on all/most AArch64 microarchitectures, but someone should check. -O3 includes -fomit-frame-pointer according to -fverbose-asm, but functions protected with -fstack-protector-strong still get a frame pointer in x29 (costing a MOV x29, sp instruction, and save/restore with STP/LDP along with x30.) However, explicitly using -fomit-frame-pointer stops that from happening. Is that a separate bug, or am I missing something? Without stack-protector, the function is vastly simpler protect_me: sub sp, sp, #16 mov w0, 3 str w0, [sp, 12] add sp, sp, 16 ret Does stack-protector really need to spill/reload x29/x30 (FP and LR)? Bouncing the return address through memory seems inefficient, even though branch prediction does hide that latency. Is that just so __stack_chk_fail can backtrace? Can we move the store of the link register into the __stack_chk_fail branch, off the fast path? Or if we do unconditionally store x30 (the link register), at least don't bother reloading it in a leaf function if register allocation didn't need to clobber it. Unlike x86-64, the return address can't be attacked with buffer overflows if it stays safe in a register the whole function. Obviously my test-case with a volatile array and no inputs at all is making -fstack-protector-strong look dumb by protecting a perfectly safe function. IDK how common it is to have leaf functions with arrays or structs that just use them for some computation on function args or globals and then return, maybe after copying the array b
[Bug target/91103] New: AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103 Bug ID: 91103 Summary: AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* GCC9.1 and current trunk aren't good at extracting high elements, whether it's with GNU C native vector syntax, or when auto-vectorizing something that ends with the result in the high element. Using VALIGND we can get any element with one immediate instruction, but its better to use AVX2 VPERMPD(immediate) when possible. Or inside loops, VPERMPS(vector), or VPERMT2PS(vector). Or of course vextractf32x4 if possible (element at the bottom of a 128-bit lane). Or with only AVX2 available, VPERMPD(immediate) for high elements in __m256 and __m256d vectors is still a big win. #include float elem12(__m512 v) { return v[12]; } float elem15(__m512 v) { return v[15]; } gcc -Ofast -march=skylake-avx512 https://godbolt.org/z/241r8p elem15: vextractf32x8 ymm0, zmm0, 0x1 vextractf128xmm0, ymm0, 0x1# elem12 ends here, after these 2 insns vshufps xmm0, xmm0, xmm0, 255 # no vzeroupper I guess because the caller must have __m512 vars too, recent optimization ret But AVX512F has vextractf32x4 to extract a 128-bit lane, which would preclude the need for AVX2 vextractf128. That's what clang does. Obviously inside a loop it would be *much* better to use a single lane-crossing VPERMPS to also avoid the shufps. Intel Skylake easily bottlenecks on shuffle throughput. We'd need a 15 in an XMM register as a control vector, but loading it would be off the latency critical path. (If we needed the scalar zero-extended instead of garbage in high elements, we could VPERMI2PS or VPERMT2PS with a zeroed vector and a shuffle-control.) --- If the element we want is an even element in the low 256 bits, we can get it with a VPERMPD-immediate. GCC does this: elem6(float __vector(16)): # GCC 10 trunk vextractf128xmm0, ymm0, 0x1 vunpckhps xmm0, xmm0, xmm0 ret Instead it should be AVX2 vpermpd ymm0, ymm0, 3 This bug also applies to __m256, not just __m512 https://www.felixcloutier.com/x86/vpermpd VPERMPD is a 64-bit granularity lane-crossing shuffle. The AVX512F immediate version reuses the immediate for another 256-bit wide shuffle in the upper half; only the vector-control version can bring an element from the top half of a ZMM down to the bottom. But if we're going to use a vector control, we might as well use VPERMPS. For the integer version of this bug, use VPERMQ -- But we can do even better by using an integer VALIGND (AVX512F) shuffle on FP data. There unfortunately isn't an FP flavour of VALIGND, just integer. AFAIK, Skylake-AVX512 still has no bypass-delay penalty for integer shuffles between FP math instructions, i.e. the shift unit is connected to both FP and integer forwarding networks. Intel's optimization manual for Skylake (client) has a bypass-latency table that shows 0 extra latency cycles for SHUF/5/1,3 reading from anything, or anything reading from it. https://www.felixcloutier.com/x86/valignd:valignq It's a 4 or 8-byte granularity version of palignr, except that it's lane-crossing so the 256 and 512-bit versions are actually useful. The immediate shift count can thus bring *any* element down to the bottom. (Using the same input twice makes it a rotate). VALIGND is good on Knight's Landing, too: unlike most 2-input shuffles, it has 1 per clock throughput. For *any* compile-time-constant index, we can always compile v[i] to this: extract15: valigndzmm0, zmm0, zmm0, 15 # I think this is right. ret The only downside I'm aware of is that some future AVX512 CPU might not run VALIGND as efficiently as SKX and KNL. For vector elements narrower than 32 bits, we may need 2 shuffles even if we consider using a shuffle-control vector. On Skylake-AVX512, AVX512BW vpermw will get the job done, but costs 2 shuffle uops. On CannonLake (and presumably other future Intel), it and AVX512VBMI vpermb are only 1 uop, so it's definitely worth creating a shuffle-control vector if it can be reused. Also worth considering instead of 2 shuffles: *unaligned* spill / reload like ICC does for GNU C native vector indexing. Store-forwarding latency is only 6 or 7 cycles I think, and it avoids any port 5 pressure. Not generally a good choice IMO when we can get the job done in one shuffle, but worth considering if we need multiple elements. If the function doe
[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103 --- Comment #4 from Peter Cordes --- We should not put any stock in what ICC does for GNU C native vector indexing. I think it doesn't know how to optimize that because it *always* spills/reloads even for `vec[0]` which could be a no-op. And it's always a full-width spill (ZMM), not just the low XMM/YMM part that contains the desired element. I mainly mentioned ICC in my initial post to suggest the store/reload strategy in general as an *option*. ICC also doesn't optimize intriniscs: it pretty much always faithfully transliterates them to asm. e.g. v = _mm_add_epi32(v, _mm_set1_epi32(1)); twice compiles to two separate paddd instructions, instead of one with a constant of set1(2). If we want to see ICC's strided-store strategy, we'd need to write some pure C that auto-vectorizes. That said, store/reload is certainly a valid option when we want all the elements, and gets *more* attractive with wider vectors, where the one extra store amortizes over more elements. Strided stores will typically bottleneck on cache/memory bandwidth unless the destination lines are already hot in L1d. But if there's other work in the loop, we care about OoO exec of that work with the stores, so uop throughput could be a factor. If we're tuning for Intel Haswell/Skylake with 1 per clock shuffles but 2 loads + 1 store per clock throughput (if we avoid indexed addressing modes for stores), then it's very attractive and unlikely to be a bottleneck. There's typically spare load execution-unit cycles in a loop that's also doing stores + other work. You need every other uop to be (or include) a load to bottleneck on that at 4 uops per clock, unless you have indexed stores (which can't run on the simple store-AGU on port 7 and need to run on port 2/3, taking a cycle from a load). Cache-split loads do get replayed to grab the 2nd half, so it costs extra execution-unit pressure as well as extra cache-read cycles. Intel says Ice will have 2 load + 2 store pipes, and a 2nd shuffle unit. A mixed strategy there might be interesting: extract the high 256 bits to memory with vextractf32x8 and reload it, but shuffle the low 128/256 bits. That strategy might be good on earlier CPUs, too. At least with movss + extractps stores from the low XMM where we can do that directly. AMD before Ryzen 2 has only 2 AGUs, so only 2 memory ops per clock, up to one of which can be a store. It's definitely worth considering extracting the high 128-bit half of a YMM and using movss then shuffles like vextractps: 2 uops on Ryzen or AMD. - If the stride is small enough (so more than 1 element fits in a vector), we should consider shuffle + vmaskmovps masked stores, or with AVX512 then AVX512 masked stores. But for larger strides, AVX512 scatter may get better in the future. It's currently (SKX) 43 uops for VSCATTERDPS or ...DD ZMM, so not very friendly to surrounding code. It sustains one per 17 clock throughput, slightly worse than 1 element stored per clock cycle. Same throughput on KNL, but only 4 uops so it can overlap much better with surrounding code. For qword elements, we have efficient stores of the high or low half of an XMM. A MOVHPS store doesn't need a shuffle uop on most Intel CPUs. So we only need 1 (YMM) or 3 (ZMM) shuffles to get each of the high 128-bit lanes down to an XMM register. Unfortunately on Ryzen, MOVHPS [mem], xmm costs a shuffle+store. But Ryzen has shuffle EUs on multiple ports.
[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #3 from Peter Cordes --- I had another look at this with current trunk. Code-gen is similar to before with -march=skylake-avx512 -mprefer-vector-width=512. (If we improve code-gen for that choice, it will make it a win in more cases.) https://godbolt.org/g/2dfkNV Loads are folding into the shifts now, unlike with gcc7.3. (But they can't micro-fuse because of the indexed addressing mode. A pointer increment might save 1 front-end uop even in the non-unrolled loop) The separate integer loop counter is gone, replaced with a compare against an end-index. But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. Fewer instructions and (more importantly) 1/3 the shuffle uops. GCC knows how to do this for the 256-bit version, so it's apparently a failure of the cost-model that it doesn't for the 512-bit version. (Maybe requiring a shuffle-control vector instead of immediate puts it off? Or maybe it's counting the cost of the useless vpand instructions for the pack / permq option, even though they're not part of the shuffle-throughput bottleneck?) We do use vpackuswb + vpermq for 256-bit, but we have redundant AND instructions with set1_epi16(0x00FF) after a right shift already leaves the high byte zero. --- Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu. GCC should be using the VEX encoding of an instruction whenever it does exactly the same thing. At least we didn't use vpandd or vpandq EVEX instructions. (I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a store with no masking. Hopefully it's efficient.)
[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #4 from Peter Cordes --- The VPAND instructions in the 256-bit version are a missed-optimization. I had another look at this with current trunk. Code-gen is similar to before with -march=skylake-avx512 -mprefer-vector-width=512. (If we improve code-gen for that choice, it will make it a win in more cases.) https://godbolt.org/g/2dfkNV Loads are folding into the shifts now, unlike with gcc7.3. (But they can't micro-fuse because of the indexed addressing mode. A pointer increment might save 1 front-end uop even in the non-unrolled loop) The separate integer loop counter is gone, replaced with a compare against an end-index. But we're still doing 2x vpmovwb + vinserti64x4 instead of vpackuswb + vpermq. Fewer instructions and (more importantly) 1/3 the shuffle uops. GCC knows how to do this for the 256-bit version, so it's apparently a failure of the cost-model that it doesn't for the 512-bit version. (Maybe requiring a shuffle-control vector instead of immediate puts it off? Or maybe it's counting the cost of the useless vpand instructions for the pack / permq option, even though they're not part of the shuffle-throughput bottleneck?) We do use vpackuswb + vpermq for 256-bit, but we have redundant AND instructions with set1_epi16(0x00FF) after a right shift already leaves the high byte zero. --- Even if vmovdqu8 is not slower, it's larger than AVX vmovdqu. GCC should be using the VEX encoding of an instruction whenever it does exactly the same thing. At least we didn't use vpandd or vpandq EVEX instructions. (I haven't found any confirmation about vmovdqu8 costing an extra ALU uop as a store with no masking. Hopefully it's efficient.)
[Bug libstdc++/71660] [6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 --- Comment #17 from Peter Cordes --- (In reply to Jonathan Wakely from comment #16) > But what we do care about is comment 2, i.e. _Atomic(T) and std::atomic > should have the same alignment (both in an out of structs). Maybe that needs > the C front-end to change how _Atomic works, or maybe it needs the C++ > library to change how std::atomic works, but I want to keep this bug open > while comment 2 gives different answers for C and C++. Right, gcc's C _Atomic ABI is still broken for long long on 32-bit x86. It only aligned _Atomic long long to 32 bits (inside structs), but then assumes that 8-byte loads / stores (with x87 or SSE1/2) are atomic. It also leads to abysmal performance for LOCK CMPXCHG or other RMW operations if the atomic object is split across a cache line. That's bug 65146, so we can close this one. (I never got around to posting in the google group for the ABI. By far the best good solution is giving _Atomic long long (and other 8-byte objects) a boost to their _Alignof, up to 8 byte alignment even inside structs.)
[Bug target/85038] New: x32: unnecessary address-size prefix when a pointer register is already zero-extended
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038 Bug ID: 85038 Summary: x32: unnecessary address-size prefix when a pointer register is already zero-extended Product: gcc Version: 8.0.1 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Bug 82267 was fixed for RSP only. (Or interpreted narrowly as only being about RSP vs. ESP). This bug is about the general case of using address-size prefixes in cases where we could prove they're not needed. Either because out-of-bounds is UB so we don't care about wrap vs. going outside 4GiB, or (simpler) the single-register case when we know the pointer is already zero-extended. Maybe we want separate bugs to track parts of this that can be fixed with separate patches, but I won't consider this fixed until -mx32 emits optimal code for all the cases listed here. I realize this won't be any time soon, but it's still code-size (and thus indirectly performance) that gcc is leaving on the table. Being smarter about using 64-bit address-size is even more useful for AArch64 -mabi=ilp32, because it doesn't have 32-bit address-size overrrides, so it always costs an extra instruction every time we fail to prove that 64-bit is safe. (And AArch64 ILP32 may get more use than x32 these days). I intended this bug to be about x32, though. Useless 0x67 address-size override prefixes hurt code-size and thus performance on everything, with more serious problems on some CPUs that have trouble with more than 3 prefixes (especially Silvermont). See Bug 82267 for the details which I won't repeat. We still have tons of useless 0x67 prefixes in the default -maddress-mode=short mode (for every memory operand other than RSP, or RIP-relative), and -maddress-mode=long has lots of missed optimizations resulting in wasted LEA instructions, so neither one is good. float doublederef(float **p){ return **p; } // https://godbolt.org/g/exb74t // gcc 8.0.1 (trunk) -O3 -mx32 -march=haswell -maddress-mode=short movl(%edi), %eax vmovss (%eax), %xmm0# could/should be (%rax) ret -maddress-mode=long gets that right, using (%rax), and also (%rdi) because the ABI doc specifies that x32 passes pointers zero-extended. mode=short still ensures that, so failure to take advantage is still a missed-opt. Note that clang -mx32 violates that ABI guarantee by compiling pass_arg(unsigned long long ptr) { ext_func((void*)ptr); } to just a tailcall (while gcc does zero-extend). See output in the godbolt link above. IDK if we care about being bug-compatible with clang for that corner case for this rare ABI, though. A less contrived case would be a struct arg or return value packed into a register passed on as just a pointer. - // arr+offset*4 is strictly within the low 32 bits because of range limits float safe_offset(float *arr, unsigned offset){ unsigned tmp = (unsigned)arr; arr = (void*)(tmp & -4096); // round down to a page offset &= 0xf; return arr[offset]; } // on the above godbolt link #mode=short andl$-4096, %edi andl$15, %esi vmovss (%edi,%esi,4), %xmm0 # (%rdi,%rsi,4) would have been safe, but that's maybe not worth looking for. # most cases have less pointer alignment than offset range #mode=long andl$-4096, %edi andl$15, %esi leal(%rdi,%rsi,4), %eax vmovss (%eax), %xmm0 # 32-bit addrmode after using a separate LEA So mode=long is just braindead here. It gets the worst of both worlds, using a separate LEA but then not taking advantage of the zero-extended pointer. The only way this could be worse is the LEA operand-size was 64-bit. Without the masking, both modes just use vmovss (%edi,%esi,4), %xmm0, but the extra operations defeat mode=long's attempts to recognize this case, and it picks an LEA instead of (or as well as?!?) an address-size prefix. --- With a 64-bit offset, and a pointer that's definitely zero-extended to 64 bits: // same for signed or unsigned float ptr_and_offset_zext(float **p, unsigned long long offset){ float *arr = *p; return arr[offset]; } # mode=short movl(%edi), %eax # mode=long uses (%rdi) here vmovss (%eax,%esi,4), %xmm0 # but still 32-bit here. ret Why are we using address-size prefixes to stop a base+index from going outside 4G on out of bounds UB? (%rax,%rsi,4) should work for a signed / unsigned 64-bit offset when the pointer is known to be zero-extended. ISO C11 says that pointer+integer produces a result of pointer type, with UB if the result goes
[Bug target/85038] x32: unnecessary address-size prefix when a pointer register is already zero-extended
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038 --- Comment #1 from Peter Cordes --- Correction for AArch64: it supports addressing modes with a 64-bit base register + 32-bit index register with zero or sign extension for the 32-bit index. But not 32-bit base registers. As a hack that's better than nothing, AArch64 could use a 32-bit pointer as the index with a UXTW mode, using a zeroed register as the base (unless indexed modes have any perf downside on real AArch64 chips). But unfortunately, the architectural zero register isn't usable as the base: that encoding means the stack pointer for this instruction. ldr w1,[xzr,w2,uxtw] doesn't assemble, only x0-x30 or SP. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0801b/BABBGCAC.html http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802b/LDR_reg_gen.html describes LDR Wt, [Xn|SP, Rm{, extend {amount}}] where Rm can be an X or W register, and "extend" can be SXTW or UXTW for word regs, or LSL for X regs. (SXTX is a synonym for LSL). Any of the modes can use a left-shift amount, applied *after* extension to 64-bit. See https://community.arm.com/processors/b/blog/posts/a64-shift-and-extend-operations-operand-modifiers for details on operand-modifiers. gcc6.3 doesn't take advantage with -mabi=ilp32, and Godbolt doesn't have later AArch64 gcc. So gcc will need to know about zero-extended pointers, and the signedness of 32-bit values, to take advantage of AArch64's addressing modes for the common case of a 32-bit index. Teaching gcc to track signed/unsigned in RTL would benefit x32 and AArch64 ILP32, if I understand the situation correctly.
[Bug target/69576] New: tailcall could use a conditional branch on x86, but doesn't
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69576 Bug ID: 69576 Summary: tailcall could use a conditional branch on x86, but doesn't Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-*, x86_64-* In x86, both jmp and jcc can use either a rel8 or rel32 displacement. Unless I'm misunderstanding something, the rel32 displacement in a jcc can be relocated at link time identically to the way the rel32 in a jmp can be. void ext(void); void foo(int x) { if (x > 10) ext(); } compiles to (gcc 5.3 -O3 -mtune=haswell) cmpl$10, %edi jg .L4 ret .L4: jmp ext Is this a missed optimization, or is there some reason gcc must avoid conditional branches for tail-calls that makes this not a bug? This sequence is clearly better, if it's safe: cmpl$10, %edi jg ext ret If targeting a CPU which statically predicts unknown forward branches as not-taken, and you can statically predict the tail-call as strongly taken, then it could make sense to use clang 3.7.1's sequence: cmpl$11, %edi jl .LBB0_1 jmp ext # TAILCALL .LBB0_1: retq According to Agner Fog's microarch guide, AMD CPUs use this static prediction strategy, but Pentium M / Core2 assign a BTB entry and use whatever prediction was in that entry already. He doesn't specifically mention static prediction for later Intel CPUs, but they're probably similar. (So using clang's sequence only helps on (some?) AMD CPUs, even if the call to ext() always happens.) AFAICT, gcc's sequence has no advantages in any case. Note that the code for https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69569 demonstrates this bug as well, but is a separate issue. It's pure coincidence that I noticed this the day after that bug was filed.
[Bug rtl-optimization/69615] New: 0 to limit signed range checks don't always use unsigned compare
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615 Bug ID: 69615 Summary: 0 to limit signed range checks don't always use unsigned compare Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- gcc sometimes misses the unsigned-compare trick for checking if a signed value is between 0 and limit (where limit is known to be <= INT_MAX). It seems that gcc fails when the upper limit is a variable, even if I shift or mask it down to a small range. clang handles this case, so I'm sure I constructed my test case in a way that could be optimized. All the code in this bug report is on godbolt, for ease of trying with older versions of gcc (including for ARM64/ARM/PPC), and with clang / icc13. http://goo.gl/V7PFmv. (I used -x c to compile as C, even though it only provides c++ compilers). This appears to be arch-independent (unless my quick skim of asm for ISAs I barely know misled me...) The simplest case is when the upper limit is a compile-time constant. There's one case where gcc and clang fail to optimize: x<=(INT_MAX-1), or equivalently, x #include extern void ext(void); // clang and gcc both optimize range checks up to INT_MAX-2 to a single unsigned compare void r0_to_imax_2(int x){ if (x>=0 && x<=(INT_MAX-2)) ext(); } // good code void r0_to_imax_1(int x){ if (x>=0 && x<=(INT_MAX-1)) ext(); } // bad code void r0_to_imax (int x){ if (x>=0 && x<=(INT_MAX-0)) ext(); } // good code (test/js. not shown) gcc 5.3.0 -Ofast -mtune=haswell compiles this to: r0_to_imax_2: cmpl$2147483645, %edi # that's 0x7ffd jbe .L4 ret .L4:jmp ext r0_to_imax_1: movl%edi, %eax subl$0, %eax ## Without any -mtune, uses test edi,edi js .L5 cmpl$2147483647, %edi # that's 0x7fff je .L5 jmp ext .L5:ret ICC13 compiles this last one to cmp edi, 0x7ffe / ja, so unless my mental logic is wrong *and* icc13 is buggy, gcc and clang should still be able to use the same optimization as for smaller upper-limits. They don't: both clang and gcc use two compare-and-branches for r0_to_imax_1. BTW, the movl %edi, %eax / subl $0, %eax sequence is used instead of the test instruction with -mtune=haswell, and even worse with -march=bdver2 where it even prevents fusion into a compare-and-branch m-op. I'll file a separate bug report for that if anyone wants me to. Agner Fog's microarch guide doesn't mention anything that would give that sequence an advantage over test, unless I'm missing something. It slows AMD down more than (recent) Intel, but that's not what tuning for Haswell means. :P Now, on to the case where the limit is variable, but can easily be proven to itself be in the range [0 .. INT_MAX-1) or much smaller. (If the limit can be negative (or unsigned greater than INT_MAX) the optimization is impossible: INT_MIN and other negative numbers could be "below" the limit.) // gcc always fails to optimize this to an unsigned compare, but clang succeeds void rangecheck_var(int64_t x, int64_t lim2) { //lim2 >>= 60; lim2 &= 0xf; // let the compiler figure out the limited range of limit if (x>=0 && x
[Bug tree-optimization/69615] 0 to limit signed range checks don't always use unsigned compare
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615 --- Comment #3 from Peter Cordes --- @Richard and Jakub: That's just addressing the first part of my report, the problem with x <= (INT_MAX-1), right? You may have missed the second part of the problem, since I probably buried it under too much detail with the first: In the case where the limit is variable, but can easily be proven to itself be in the range [0 .. INT_MAX-1) or much smaller: // gcc always fails to optimize this to an unsigned compare, but clang succeeds void rangecheck_var(int64_t x, int64_t lim2) { //lim2 >>= 60; lim2 &= 0xf; // let the compiler figure out the limited range of limit if (x>=0 && x
[Bug target/69622] New: compiler reordering of non-temporal (write-combining) stores produces significant performance hit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69622 Bug ID: 69622 Summary: compiler reordering of non-temporal (write-combining) stores produces significant performance hit Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-linux-gnu, x86_64-linux-gnu IDK whether to mark this as "target" or something else. Other architectures might have similar write-combining stores that are sensitive to writing whole cache-lines at once. For background, see this SO question: http://stackoverflow.com/questions/25778302/wrong-gcc-generated-assembly-ordering-results-in-performance-hit In an unrolled copy loop, gcc decides to emit vmovntdq stores in a different order than they appear in the source. There's no correctness issue, but the amount of fill-buffers is very limited (maybe each core has 10 or so?). So it's *much* better to write all of one cacheline, then all of the next cacheline. See my answer on that SO question for lots of discussion and links. The poster of that question got a 33% speedup (from ~10.2M packets per second to ~13.3M packets per second by putting the loads and stores in source order in the binary. (Unknown hardware and surrounding code, but presumably this loop is *the* bottleneck in his app). Anyway, real numbers show that this isn't just a theoretical argument that some code would be better. Compilable test-case that demonstrates the issue: #include #include //#define compiler_writebarrier() __asm__ __volatile__ ("") #define compiler_writebarrier() // empty. void copy_mcve(void *const destination, const void *const source, const size_t bytes) { __m256i *dst = destination; const __m256i *src = source; const __m256i *dst_endp = (destination + bytes); while (dst < dst_endp) { __m256i m0 = _mm256_load_si256( src + 0 ); __m256i m1 = _mm256_load_si256( src + 1 ); __m256i m2 = _mm256_load_si256( src + 2 ); __m256i m3 = _mm256_load_si256( src + 3 ); _mm256_stream_si256( dst+0, m0 ); compiler_writebarrier(); // even one anywhere in the loop is enough for current gcc _mm256_stream_si256( dst+1, m1 ); compiler_writebarrier(); _mm256_stream_si256( dst+2, m2 ); compiler_writebarrier(); _mm256_stream_si256( dst+3, m3 ); compiler_writebarrier(); src += 4; dst += 4; } } compiles (with the barriers defined as a no-op) to (gcc 5.3.0 -O3 -march=haswell: http://goo.gl/CwtpS7): copy_mcve: addq%rdi, %rdx cmpq%rdx, %rdi jnb .L7 .L5: vmovdqa 32(%rsi), %ymm2 subq$-128, %rdi subq$-128, %rsi vmovdqa -64(%rsi), %ymm1 vmovdqa -32(%rsi), %ymm0 vmovdqa -128(%rsi), %ymm3 # If dst is aligned, the four halves of two cache lines are {A B} {C D}: vmovntdq%ymm2, -96(%rdi) # B vmovntdq%ymm1, -64(%rdi) # C vmovntdq%ymm0, -32(%rdi) # D vmovntdq%ymm3, -128(%rdi)# A cmpq%rdi, %rdx ja .L5 vzeroupper .L7:ret If the output buffer is aligned, that B C D A store ordering maximally separates the two halves of the first cache line, giving the most opportunity for partially-full fill buffers to get flushed. Doing the +32 load first makes no sense with that placement of the pointer-increment instructions. Doing the +0 load first could save a byte of code-size by not needing a displacement byte. I'm guessing that's what one optimizer function was going for when it put the subs there, but then something else came along and re-ordered the loads. Is there something that tries to touch both cache-lines as early as possible, to trigger the loads? Assuming the buffer is 64B-aligned? Doing the subs after the last store would save another insn byte, because one of the stores could use an empty displacement as well. That's where clang puts the pointer increments (and it keeps the loads and stores in source order). clang also uses vmovaps / vmovntps. It's probably a holdover from saving an insn byte in the non-VEX encoding of the 128b insn, but does make the output work with AVX1 instead of requiring AVX2. Using a 2-register addressing mode for the loads could save a sub instruction inside the loop. Increment dst normally, but reference src with a 2-register addressing mode with dst and a register initialized with src-dst. (In the godbolt link, uncomment the #define ADDRESSING_MODE_HACK. With ugly enough source, gcc can be bludgeoned into making code like that. It wastes insns in the intro, though, a
[Bug tree-optimization/68557] Missed x86 peephole optimization for multiplying by a bool
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68557 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2 from Peter Cordes --- Besides code-size, uop-cache size is a factor for Intel CPUs. imul is only a single uop, while neg/and is 2 uops. Total number of instructions is a factor for other CPUs, too, but only locally. (Saving uop-cache space can mean speedups for *other* code that doesn't get evicted). If the operation isn't part of a long dependency chain, imul is a better choice on almost all CPUs. Let OOO execution sort it out. When latency matters some, we have to weigh the tradeoff of code-size / more insns and uops vs. slightly (or much) higher latency. Agner Fog's instruction tables indicate that 32bit imul is probably ok for tune=generic, but 64bit imul should maybe only be used with -mtune=intel (but absolutely not with tune=atom. Maybe not with tune=silvermont either, but it does have modest OOO capabilities to hide the latency. It's not as wide, so saving insns maybe matters more?). I'm not sure if tune=intel is supposed to put much weight on pre-Silvermont Atom. From Agner Fog's spreadsheet, updated 2016-Jan09: uops/m-ops latency recip-throughput execution pipe/port Intel:SnB-family(Sandybridge through Skylake) imul r32,r32: 1 31 p1 imul r64,r64: 1 31 p1 AMD:bdver1-3 imul r32,r32: 1 42 EX1 imul r64,r64: 1 64 EX1 Intel:Silvermont imul r32,r32: 1 31 IP0 imul r64,r64: 1 52 IP0 AMD:bobcat/jaguar imul r32,r32: 1 31 I0 imul r64,r64: 1 64 I0 old HW Intel:Nehalem imul r32,r32: 1 31 p1 imul r64,r64: 1 31 p0 Intel:Merom/Penryn(Core2) imul r32,r32: 1 31 p1 imul r64,r64: 1 52 p0 (same as FP mul, maybe borrows its wider multiplier?) Intel:Atom imul r32,r32: 1 52 Alu0,Mul imul r64,r64: 6 13 11 Alu0,Mul AMD:K8/K10 imul r32,r32: 1 31 ALU0 imul r64,r64: 1 42 ALU0_1 (uses units 0 and 1) VIA:Nano3000 imul r32,r32: 1 21 I2 imul r64,r64: 1 52 MA If gcc keeps track of execution port pressure at all, it should also avoid imul when surrounding code is multiply-heavy (or doing other stuff that also contends for the same resources as imul). I didn't check on neg/and, but I assume every microarchitecture can run them on any port with one cycle latency each. getting off topic here: tune=generic should account for popularity of CPUs, right? So I hope it won't sacrifice much speed for SnB-family in order to avoid something that's slow on Pentium4, I hope. (e.g. P4 doesn't like inc/dec, but all other CPUs rename the carry flag separately to avoid the false dep. Not a great example, because that only saves a couple code bytes. shrd isn't a good example, because it's slow even on AMD Bulldozer.) Is there a tune=no_glass_jaws that *will* give up speed (or code size) for common CPUs in order to avoid things that are *really* bad on some rare microarchitectures, (especially old ones)? Or maybe a tune=desktop to doesn't care what's slow on Atom/Jaguar? People distributing binaries that probably won't be used on Atom/Silvermont netbooks might use that. Anyway, I think it would be neat to have the option of making a binary that will be quite good on SnB, not have major problems on recent AMD, but I don't care if it has the occasional slow instruction on Atom or K8. Or alternatively to have a binary that doesn't suck badly anywhere.
[Bug middle-end/51837] Use of result from 64*64->128 bit multiply via __uint128_t not optimized
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51837 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1 from Peter Cordes --- This bug isn't limited to __uint128. There seems to be a general problem with full-multiplies (at least on x86). I see similar excess-mov instructions (or saving/restoring a register that is not touched) with a 32*32->64b full multiply in 32bit code. gcc does a bad job with mulx as well. Reminder: mulx has two read-only input operands: implicit rdx and explicit source operand (first operand in AT&T syntax). The next two operands are write-only output operands: high half and low half, in that order in AT&T syntax. summary of things I noticed: * sometimes mul r32 is the right choice, even when imul r64,r64 is available. It's faster on some CPUs, but even -mtune=atom (the most extreme case of slow 64bit imul) uses imul r64,r64. * mulx r64,r64,r64 has one cycle more latency than mul r64 on Haswell. * gcc doesn't notice that it should put C in edx to for mulx when it needs B*C and A*C. * gcc is terrible at choosing output registers for mulx * It can matter which order multiplies appear in the source. If one of the multiplies has only half its results used, it can save instructions to do it first, since only one of rdx and rax need to be saved. compilers should re-order for optimal code regardless of source order. * is it normal for -m32 -Os to default to making stack frames? clang -m32 -Os doesn't. test cases: (godbolt link: http://goo.gl/2iL7f2) // 32bit version of Steven's testcase #include uint64_t foo64(unsigned x, unsigned y) { uint64_t z = (uint64_t)x * y; return z ^ (z >> 32); } gcc 5.3.0 -O3 -m32 pushl %ebx movl12(%esp), %eax mull8(%esp) popl%ebx # save/restore of an otherwise-unused register. Regression from 4.9.2 xorl%edx, %eax gcc 5.3.0 -O3 -m32 -mbmi2 pushl %ebx movl12(%esp), %eax movl%eax, %edx # oops we're using mulx, not mul? mulx8(%esp), %ecx, %ebx movl%ecx, %eax # this is sub-optimal even with that choice of output reg for mulx movl%ebx, %edx xorl%ebx, %eax # note that even with ideal sequences, mulx didn't gain anything popl%ebx 64b: gcc 5.3.0 -O3 -mbmi2 # 64bit mode: 32bit mulx would help significantly, but it isn't used movl%edi, %eax movl%esi, %esi imulq %rax, %rsi movq%rsi, %rax shrq$32, %rax xorq%rsi, %rax hand-optimized 64bit: same as the 32bit case. mov %edi, %eax mul %esi# mulx isn't helpful here xor %edx, %eax Even when inlined somewhere that doesn't need the upper halves of input regs zeroed, its 3 uops on SnB for mul r32 vs 3 for imul r,r + mov + shr, with better or equal latency (depending on mov being 0 or 1c) I guess this would require recognizing when we want 2 halves of a multiply, and using the otherwise-slower single-operand form of mul. Note that AMD BD-family runs `mul r32` faster than `imul r64, r64`, and so does Atom (but not Silvermont). --- //Steven's function: uint64_t foo128(uint64_t x, uint64_t y) { __uint128_t z = (__uint128_t)x * y; return z ^ (z >> 64); } gcc 5.3.0 -O3: same as Steven reported for gcc 4.7 gcc 5.3.0 -O3 -march=haswell movq%rdi, %rdx # correct startup sequence mulx%rsi, %r9, %r10# bad choice of output regs, like 32bit movq%r10, %rax # correct sequence for handling the badly-chosen mulx outputs xorq%r9, %rax At 64bit operand size, mul has one cycle lower latency than mulx on Haswell, so it's only a better choice when the choice of outputs helps, or the different implicit input (rdx instead of rax). Obviously we can avoid the mov and the REX prefixes by choosing different output registers. clang uses rcx and rax as output registers for mulx, which is the obvious choice. (or overwrite an input register). // A slightly more complex function: struct DE64 { uint64_t D,E; }; struct DE64 f64_structret(uint64_t A, uint64_t B, uint64_t C) { __uint128_t AC = A * (__uint128_t)C; // gcc makes slightly better code with BC first. Order shouldn't matter __uint128_t BC = B * (__uint128_t)C; uint64_t D = AC >> 64; // high half uint64_t E = AC + (BC >> 64); struct DE64 retval = { D, E }; return retval; } # C is already in rdx, which is perfect for mulx. In the 32bit case (below), gcc doesn't realize it should put C into edx for easy reuse. g
[Bug c++/67461] Multiple atomic stores generate a StoreLoad barrier between each one, not just at the end
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461 --- Comment #2 from Peter Cordes --- (In reply to Andrew Pinski from comment #1) > Hmm, I think there needs to be a barrier between each store as each store > needs to be observed by the other threads. On x86, stores are already ordered wrt. other stores. A full-barrier (including a StoreLoad barrier) after the last store will prevent it from passing (appearing after) any subsequent loads. StoreStore, LoadLoad, and LoadStore barriers are implicit between every memory operation. (except non-temporal ones). http://preshing.com/20120710/memory-barriers-are-like-source-control-operations/ I *think* that's enough for sequential consistency. If *I'm* misunderstanding this (which is possible), then please clue me in. There's definitely a problem on ARM, though. There's no way two consecutive dmb sy instructions are useful.
[Bug tree-optimization/69908] New: recognizing idioms that check for a buffer of all-zeros could make *much* better code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69908 Bug ID: 69908 Summary: recognizing idioms that check for a buffer of all-zeros could make *much* better code Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Checking a block of memory to see if it's all-zero, or to find the first non-zero, seems to be a not-uncommon problem. It's really hard to get gcc to emit code that's even half-way efficient. The most recent stackoverflow question about this (with links to previous ones) is http://stackoverflow.com/questions/35450237/fastest-way-to-check-mass-data-if-null-in-c Summary: * gcc would benefit a lot from recognizing zero-checking idioms (with a suggested asm loop for x86). * one zero-checking function compiles to bad x86 code in multiple ways * even a simple byte-at-a-time loop on a fixed-size buffer compiles to byte-at-a-time asm. * gcc is bad at horizontal reductions, esp with AVX2 I'm using x86 asm for examples of whether gcc auto-vectorizes or not, but this is architecture-independent. - Ideally we'd like the main loop in these functions to test 64B at a time (a whole cache-line on all modern x86 microarchitectures), something like: ... some intro stuff ... pxor %xmm5, %xmm5 .Lvec_loop: movdqa (%rsi), %xmm0 por16(%rsi), %xmm0 por32(%rsi), %xmm0 por48(%rsi), %xmm0 #ptest %xmm0, %xmm0 # SSE4.1 #jnz .Lnonzero_found pcmpeqb %xmm5, %xmm0 pmovmskb %xmm0, %eax cmp $0x, %eax # check that all the bytes compared equal to zero jne .Lnonzero_found add$64, %rsi cmppointer, end jb $Lvec_loop # Intel: 9 fused-domain uops in the loop: # one too many to issue in 2 cycles and saturate 2 loads per cycle. # epilogue for the final partial cache-line # We can test some bytes again, # e.g. using 16B unaligned loads that end at the correct place movdqu -16(end), %xmm0 movdqu -32(end), %xmm1 por %xmm1, %xmm0 movdqu -48(end), %xmm1 por %xmm1, %xmm0 movdqu -64(end), %xmm3 por %xmm1, %xmm0 # ptest or pcmpeq / pmovmskb / cmp We'd have the intro code handle inputs smaller than 64B, so the epilogue couldn't access memory from before the start of the buffer. pcmpeq / pmovmsk / cmp is better than pshufd / por / movq / test, esp for 32bit where another round of horizontal reduction would be needed. It might be better to use two accumulators to make better use of two load ports from a single cache-line, but hopefully the loads will dispatch mostly in program order, so there hopefully won't be many cache-bank conflicts on SnB/IvB when multiple iterations are in flight at once. The POR dep chain is not loop-carried, and out-of-order execution should hide it nicely. I have no idea how to write C (without intrinsics) that would auto-vectorize to anything like that, or even to something acceptable. It would be nice if there was some kind of idiom that compilers could recognize and make good code for, without needing custom code for every platform where we want non-terrible output. -- The most solid attempt on that SO question ORs together eight size_t elements in the main loop, then uses a byte cleanup loop. gcc makes a mess of it: Summary of problems with gcc 5.3.0 -O3 -march=haswell for this function: (I can report separate bugs for the separate problems; Other than recognizing zero-checking idioms, most of these problems could probably be fixed separately.) * gcc doesn't realize that we're ultimately testing for all-zero, and just treats OR as any other associative operation. * even a simple byte-loop over a fixed-size buffer doesn't get optimized at all (different function, see below) * main loop not auto-vectorized * word-at-a-time and byte-at-a-time cleanup loops generate full loops: gcc doesn't realize they're just cleanup that will only do less than one vector of data. * word-at-a-time cleanup loop gets a bloated fully-unrolled scalar intro (which is all that will ever run) * byte cleanup loop auto-vectorization unpacks vectors of bytes to longs before ORing, with a big chain of vextracti128 / vmovzx. * Without AVX2, gcc does a full-unroll of the unaligned-epilogue for the byte cleanup autovectorization. The bad auto-vectorized cleanup-loop code will never run, only their scalar intros, because of the logic of the function. Presumably gcc would generate the nasty pmovzx byte-unpacking code in situations where it would actually run. The byte cleanup loop has a byte-at-a-time scalar intro loop (not unrolled), wh
[Bug rtl-optimization/69933] New: non-ideal branch layout for an early-out return
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69933 Bug ID: 69933 Summary: non-ideal branch layout for an early-out return Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- (just guessing about this being an RTL bug, please reassign if it's target-specific or something else). This simple linked-list traversal compiles to slightly bulkier code than it needs to: int traverse(struct foo_head *ph) { int a = -1; struct foo *p, *pprev; pprev = p = ph->h; while (p != NULL) { pprev = p; p = p->n; } if (pprev) a = pprev->a; return a; } (gcc 5.3.0 -O3 on godbolt: http://goo.gl/r8vb5L) movq(%rdi), %rdx movl$-1, %eax ; only needs to happen in the early-out case testq %rdx, %rdx jne .L3 ; jne/ret or je / fall through would be better jmp .L9 .L5: movq%rax, %rdx .L3: movq(%rdx), %rax testq %rax, %rax jne .L5 movl8(%rdx), %eax ret .L9: ; ARM / PPC gcc 4.8.2 put the a=-1 down here ret ; this is a rep ret without -mtune=intel Clang 3.7 chooses a better layout with a je .early_out instead the jne / jmp. It arranges the loop so it can enter at the top. It actually look pretty optimal: movq(%rdi), %rcx movl$-1, %eax testq %rcx, %rcx je .LBB0_3 .LBB0_1:# %.lr.ph movq%rcx, %rax movq(%rax), %rcx testq %rcx, %rcx jne .LBB0_1 movl8(%rax), %eax .LBB0_3:# %._crit_edge.thread retq Getting the mov $-1 out of the common case would require a separate mov/ret block after the normal ret, so it's a code-size tradeoff which isn't worth it, because a mov-immediate is dirt cheap. Anyway, there are a couple different ways to lay out the branches and the mov $-1, %eax, but gcc's choice is in no way optimal. :(
[Bug tree-optimization/69935] New: load not hoisted out of linked-list traversal loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69935 Bug ID: 69935 Summary: load not hoisted out of linked-list traversal loop Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- (please check the component. I guessed tree-optimization since it's cross-architecture.) gcc doesn't hoist the p->a load out of the loop in this linked-list function int traverse_loadloop(struct foo_head *ph) { int a = -1; struct foo *p = ph->h; while (p) { a = p->a; p = p->n; } return a; } I checked on godbolt with gcc 4.8 on ARM/PPC/ARM64, and gcc 4.5.3 for AVR. For x86, gcc 5.3.0 -O3 on godbolt (http://goo.gl/r8vb5L) does this: movq(%rdi), %rdx movl$-1, %eax testq %rdx, %rdx je .L10 .L11: movl8(%rdx), %eax ; load p->a inside the loop, not hoisted movq(%rdx), %rdx testq %rdx, %rdx jne .L11 .L10: rep ret This is nice and compact, but less hyperthreading-friendly than it could be. (The mov reg,reg alternative doesn't even take an execution unit on recent CPUs). The load of p->a every time through the loop might also delay the p->n load by a cycle on CPUs with only one load port, or when there's a cache-bank conflict. This might take the loop from one iteration per 4c to one per 5c (if L1 load-use latency is 4c). Clang hoists the load out of the loop, producing identical asm output for this function and one with the load hoisted in the C source. (The godbolt link has both versions. Also see bug 69933 which I just reported, since gcc showed a separate branch-layout issue for the source-level hoisting version.)
[Bug rtl-optimization/69943] New: expressions with multiple associative operators don't always create instruction-level parallelism
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69943 Bug ID: 69943 Summary: expressions with multiple associative operators don't always create instruction-level parallelism Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- separate problems (which maybe should be separate bugs, let me know): * associativity not exploited for ILP in integer operations * using a mov from memory instead of an add * FP ILP from associativity generates two extra mov instructions gcc 5.3.0 -O3 (http://goo.gl/IRdw05) has two problems compiling this: int sumi(int a, int b,int c,int d,int e,int f,int g,int h) { return a+b+c+d+e+f+g+h; } addl%edi, %esi movl8(%rsp), %eax # when an arg comes from memory, it forgets to use lea as a 3-arg add addl%esi, %edx addl%edx, %ecx addl%ecx, %r8d addl%r8d, %r9d addl%r9d, %eax addl16(%rsp), %eax The expression is evaluated most in order from left to right, not as ((a+b) + (c+d)) + ((e+f) + (g+h)). This gives is a latency of 8 clocks. If the inputs became ready at one-per-clock, this would be ideal (only one add depends on the last input), but we shouldn't assume that when we can't see the code that generated them. The same lack of parallelism happens on ARM, ARM64, and PPC. - The FP version of the same *does* take advantage of associativity for parallelism with -ffast-math, but uses two redundant mov instructions: float sumf(float a, float b,float c,float d,float e,float f,float g,float h) { return a+b+c+d+e+f+g+h; } addss %xmm4, %xmm5 # e, D.1876 addss %xmm6, %xmm7 # g, D.1876 addss %xmm2, %xmm3 # c, D.1876 addss %xmm0, %xmm1 # a, D.1876 addss %xmm7, %xmm5 # D.1876, D.1876 movaps %xmm5, %xmm2# D.1876, D.1876 addss %xmm3, %xmm2 # D.1876, D.1876 movaps %xmm2, %xmm0# D.1876, D.1876 addss %xmm1, %xmm0 # D.1876, D.1876 clang avoids any unnecessary instructions, but has less FP ILP, and the same lack of integer ILP. Interestingly, clang lightly auto-vectorizes sumf when the expression is parenthesised for ILP, but only *without* -ffast-math. http://goo.gl/Pqjtu1. As usual, IDK whether to mark this as RTL, tree-ssa, or middle-end. The integer ILP problem is not target specific.
[Bug tree-optimization/69943] expressions with multiple associative operators don't always create instruction-level parallelism
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69943 --- Comment #3 from Peter Cordes --- (In reply to ktkachov from comment #2) > On second thought, reassociating signed addition is not legal in general > because we might introduce signed overflow where one didn't exist before. In an intermediate result, yes. The final result can't change on 2's complement hardware, though, so it's a legal optimization. Good thinking that the compiler might treat signed and unsigned integers differently, though. You only need to avoid it on hardware where signed overflow has side-effects. (e.g. setting a "sticky" flag that can be checked after a sequence of operations to see if there were any overflows along the way.) And I think MIPS has signed vs. unsigned add instructions, and can raise an exception? Anyway, x86 doesn't have any of those things, and the calling convention lets flags be in any arbitrary state when the function return. So this optimization is valid for signed integers on x86. BTW, using unsigned results in two LEA instructions, even though there's still a MOV from memory. ADD is shorter to encode, and can run on more execution ports. It also has higher latency on Atom.
[Bug target/69986] New: smaller code possible with -Os by using push/pop to spill/reload
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69986 Bug ID: 69986 Summary: smaller code possible with -Os by using push/pop to spill/reload Product: gcc Version: 5.3.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: minor Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86-64-*-* #include int f(int a) { close(a); return a; } push rbx movebx,edi call 400490 moveax,ebx poprbx ret with gcc 5.3 -Os. It could be smaller: push rbi call 400490 poprax ret saving 4 bytes (mov reg,reg is two bytes). More generally, push/pop are 1 byte each, much smaller than mov [rsp-8], edi or something. This might not be a desirable optimization, though, because a round-trip through memory increases latency. It's one of those code-size optimizations that will might often have a negative impact on performance in the case where the function is already hot in L1 I-cache. It would be nice if there was a way to optimize a bit for code-size without making bad performance sacrifices, and also another option to optimize for code size without much regard for performance. -Oss vs. -Os? Or -OS? I assume tuning these options is a lot of work.
[Bug rtl-optimization/70408] New: reusing the same call-preserved register would give smaller code in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70408 Bug ID: 70408 Summary: reusing the same call-preserved register would give smaller code in some cases Product: gcc Version: 6.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- int foo(int); // not inlineable int bar(int a) { return foo(a+2) + 5 * foo (a); } gcc (and clang and icc) all make bigger code than necessary for x86. gcc uses two call-preserved registers to save `a` and `foo(a+2)`. Besides the extra push/pop, stack alignment requires a sub/add esp,8 pair. Combining data-movement with arithmetic wherever possible is also a win (using lea), but gcc also misses out on that. # gcc6 snapshot 20160221 on godbolt (with -O3): http://goo.gl/dN5OXD pushq %rbp pushq %rbx movl%edi, %ebx leal2(%rdi), %edi # why lea instead of add rdi,2? subq$8, %rsp callfoo# foo(a+2) movl%ebx, %edi movl%eax, %ebp callfoo# foo(a) addq$8, %rsp leal(%rax,%rax,4), %eax popq%rbx addl%ebp, %eax popq%rbp ret clang 3.8 makes essentially the same code (but wastes an extra mov because it doesn't produce the result in %eax). By hand, the best I can come up with is: push%rbx lea 2(%rdi), %ebx # stash ebx=a+2 callfoo# foo(a) mov %ebx, %edi lea (%rax,%rax,4), %ebx# reuse ebx to stash 5*foo(a) callfoo# foo(a+2) add %ebx, %eax pop %rbx ret Note that I do the calls to foo() in the other order, which allows more folding of MOV into LEA. The savings from that are somewhat orthogonal to the savings from reusing the same call-preserved register. Should I open a separate bug report for the failure to optimize by reordering the calls? I haven't tried to look closely at ARM or PPC code to see if they succeed at combining data movement with math (prob. worth testing with `foo(a) * 4` since x86's shift+add LEA is not widely available). I didn't mark this as an i386/x86-64 but, because the reuse of call-preserved registers affects all architectures. IDK if teaching gcc about either of these tricks would help with real code in many cases, or how hard it would be.
[Bug c/70408] reusing the same call-preserved register would give smaller code in some cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70408 --- Comment #2 from Peter Cordes --- Should I open a separate bug for the reusing call-preserved regs thing, and retitle this one to the call-reordering issue we ended up talking about here? I always have a hard time limiting an optimization bug report to a single issue, sorry. (In reply to Andrew Pinski from comment #1) > Note teaching this trick requires a huge amount of work as you need to teach > GCC more about order of operands does not matter; this requires work in the > front-end and then in the gimple level and then maybe the middle-end. Ok :( > Is it worth it for the gain, most likely not, you are more likely just to > get better code by not depending on unspecified behavior in C. Writing the code this way intentionally leaves it up to the compiler to choose the optimal order to evaluate foo(a+2) and foo(a). I don't see why forcing the compiler into one choice or the other should be considered "better" for performance, just because gcc doesn't take advantage of its options. (Better for maintainability in case someone adds side-effects to foo(), sure). I should have used __attribute__((pure)) int foo(int); to make it clear that the order of the function calls didn't matter. That would make reordering legal even the calls were separated by a sequence point, wouldn't it? (Of course, it sounds like gcc still wouldn't consider doing the reordering). > ># why lea instead of add rdi,2? > > Because lea does not clobber the flags, so this might be faster, it depends > on the machine. Every OOO x86 CPU renames EFLAGS, because almost every instruction writes flags. There aren't any CPUs where instructions that don't write flags are faster for that reason. (Not writing flags is useful when it lets you reuse some already-set flags for another check with a different condition, or stuff like that, but that's not the case here). On Intel Haswell for example, the LEA can run on port 1 or 5, but the add can run on port 0,1,5,6. Otherwise they're the same (latency, total uops, and code-size). Using `-mtune=haswell` doesn't get it to choose add edi,2 :( (From http://agner.org/optimize/ instruction tables, and Agner's microarch pdf) LEA is special on Atom. I don't remember exactly what its effect is on latency in Atom's in-order pipeline, but LEA happens at a different pipeline stage from normal ALU instructions (actually running on the AGUs). IIRC, that's an earlier stage, so inputs need to be ready sooner. > Also try -Os you might see a difference code. No change with -Os
[Bug c++/71245] New: std::atomic load/store bounces the data to the stack using fild/fistp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245 Bug ID: 71245 Summary: std::atomic load/store bounces the data to the stack using fild/fistp Product: gcc Version: 6.1.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-linux-gnu Same result with gcc 4.8, gcc5, and gcc6.1. Didn't test exhaustively. #include std::atomic d(5.0); void foo_d(void) { d = d + 1.0; // d+=1.0; // unimplemented } with gcc6.1 -m32 -O3 -march=i586 (https://godbolt.org/g/w3VKpG) this compiles to foo_d(): subl$20, %esp #, fildq d ## fistpq (%esp) # %sfp # copy `d`'s bits to the stack (with no loss) fldl(%esp) # %sfp fadds .LC0 # fstpl (%esp)# %sfp # store d + 1.0 to the stack fildq (%esp)# %sfp # fistpq d # # atomic store using fild lock orl$0, (%esp) ## mfence equivalent addl$20, %esp #, ret I assume fild/fistp is gcc's trick for implementing atomic loads and stores without resorting to cmpxchg8b. Clever, since 80bit float can't munge the data. The fild/fistp pairs are of course not necessary in this case, where the data is already 64bit float. The function should just fld / fstp to d directly. With -march=i486 or lower, gcc correctly doesn't assume that 64bit FP loads/stores are atomic, so it calls a library function to do the atomic load and store. With SSE or SSE2 available, it uses an SSE load/store to copy to the stack. With -msse2 and -mfpmath=sse, we finally load/store directly from/to d, with movq / addsd / movq. movq vs. movsd shouldn't make a performance difference, I think. We don't need to allocate any stack space. We could implement the StoreLoad barrier with lock or $0, -4(%esp) instead of reserving extra stack to avoid doing it to our return address (which would introduce extra store-forwarding delay before the ret could eventually retire).
[Bug target/71321] New: [6 regression] x86: worse code for uint8_t % 10 and / 10
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71321 Bug ID: 71321 Summary: [6 regression] x86: worse code for uint8_t % 10 and / 10 Product: gcc Version: 6.1.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-linux-gnu, x86_64-linux-gnu If we have an integer (0..99), we can modulo and divide by 10 to get two decimal digits, then convert to a pair of ASCII bytes with a newline by adding `00\n`. When replacing div and mod with a multiplicative inverse, gcc 6.1 uses more instructions than gcc 5.3, due to poor choices. See also https://godbolt.org/g/vvS5J6 #include // assuming little-endian __attribute__((always_inline)) unsigned cvt_to_2digit(uint8_t i, uint8_t base) { return ((i / base) | (uint32_t)(i % base)<<8); } // movzbl %dil,%eax# 5.3 and 6.1, with -O3 -march=haswell // div%sil // movzwl %ax,%eax // at -Os, gcc uses a useless AND eax, 0xFFF, instead of a movzx eax,ax. I think to avoid partial-register stalls? unsigned cvt_to_2digit_ascii(uint8_t i) { return cvt_to_2digit(i, 10) + 0x0a3030;// + "00\n" converts to ASCII } Compiling with -O3 -march=haswell ## gcc 5.3 ## gcc 6.1 movzbl %dil, %edx movzbl %dil, %eax leal(%rdx,%rdx,4), %ecxleal0(,%rax,4), %edx # requires a 4B zero displacement leal(%rdx,%rcx,8), %edxmovl%eax, %ecx # lea should let us avoid mov leal(%rdx,%rdx,4), %edxaddl%eax, %edx leal(%rcx,%rdx,8), %edx leal0(,%rdx,4), %eax # requires a 4B zero displacement addl%eax, %edx shrw$11, %dx shrw$11, %dx leal(%rdx,%rdx,4), %eaxleal0(,%rdx,4), %eax # requires a 4B zero displacement. gcc5.3 didn't use any of these addl%edx, %eax movzbl %dl, %edx movzbl %dl, %edx # same after this addl%eax, %eax addl%eax, %eax subl%eax, %edi subl%eax, %edi movzbl %dil, %eax movzbl %dil, %eax sall$8, %eax sall$8, %eax orl %eax, %edx orl %eax, %edx leal667696(%rdx), %eax leal667696(%rdx), %eax with -mtune=haswell, it's prob. best to merge with mov ah, dil or something, rather than movzx/shift/or. Haswell has no penalty for partial-registers, but still has partial-reg renaming to avoid false dependencies: the best of both worlds. BTW, with -Os, both gcc versions compile it to movb$10, %dl movzbl %dil, %eax divb%dl andl$4095, %eax # partial reg stall. gcc does this even with -march=core2 where it matters addl$667696, %eax The AND appears to be totally useless, because the upper bytes of eax are already zero (from movzbl %dil, %eax before div). I thought the movzbl %ax, %eax in the unknown-divisor version was to avoid partial-register slowdowns, but maybe it's just based on the possible range of the result. Off-topic, but I noticed this while writing FizzBuzz in asm. http://stackoverflow.com/a/37494090/224132
[Bug target/71245] std::atomic load/store bounces the data to the stack using fild/fistp
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245 --- Comment #3 from Peter Cordes --- (In reply to Uroš Bizjak from comment #2) > Recently x86 linux changed the barrier to what you propose. If it is worth, > we can change it without any problems. I guess it costs a code byte for a disp8 in the addressing mode, but it avoids adding a lot of latency to a critical path involving a spill/reload to (%esp), in functions where there is something at (%esp). If it's an object larger than 4B, the lock orl could even cause a store-forwarding stall when the object is reloaded. (e.g. a double or a vector). Ideally we could do the lock orl on some padding between two locals, or on something in memory that wasn't going to be loaded soon, to avoid touching more stack memory (which might be in the next page down). But we still want to do it on a cache line that's hot, so going way up above our own stack frame isn't good either. > OTOH, we have "orl" here - should we > change it to "addl" to be consistent with kernel? That's the common idiom I've seen, but there's no reason I know of to favour ADD instead of OR. They both write all the flags, and both can run on any ALU port on every microarchitecture. Since gcc has been using OR already with I assume nobody reporting perf problems, we should keep it. A 32bit operand size is still a good choice. (The obvious alternative being 8bit, but that doesn't save any code size. From Agner Fog's insn tables, I don't see any different entry for locked instructions with m8 vs. m32 operands, but naturally-aligned 32bit loads/stores are probably the safest bet.)
[Bug rtl-optimization/59511] [4.9 Regression] FAIL: gcc.target/i386/pr36222-1.c scan-assembler-not movdqa with -mtune=corei7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59511 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #6 from Peter Cordes --- Created attachment 38629 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=38629&action=edit extra-movdqa-with-gcc5-not-4.9.cpp
[Bug rtl-optimization/59511] [4.9 Regression] FAIL: gcc.target/i386/pr36222-1.c scan-assembler-not movdqa with -mtune=corei7
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59511 --- Comment #7 from Peter Cordes --- I'm seeing the same symptom, affecting gcc4.9 through 5.3. Not present in 6.1. IDK if the cause is the same. (code from an improvement to the horizontal_add functions in Agner Fog's vector class library) #include int hsum16_gccmovdqa (__m128i const a) { __m128i lo= _mm_cvtepi16_epi32(a); // sign-extended a0, a1, a2, a3 __m128i hi= _mm_unpackhi_epi64(a,a); // gcc4.9 through 5.3 wastes a movdqa on this hi= _mm_cvtepi16_epi32(hi); __m128i sum1 = _mm_add_epi32(lo,hi); // add sign-extended upper / lower halves //return horizontal_add(sum1); // manually inlined. // Shortening the code below can avoid the movdqa __m128i shuf = _mm_shuffle_epi32(sum1, 0xEE); __m128i sum2 = _mm_add_epi32(shuf,sum1); // 2 sums shuf = _mm_shufflelo_epi16(sum2, 0xEE); __m128i sum4 = _mm_add_epi32(shuf,sum2); return _mm_cvtsi128_si32(sum4); // 32 bit sum } gcc4.9 through gcc5.3 output (-O3 -mtune=generic -msse4.1): movdqa %xmm0, %xmm1 pmovsxwd%xmm0, %xmm2 punpckhqdq %xmm0, %xmm1 pmovsxwd%xmm1, %xmm0 paddd %xmm2, %xmm0 ... gcc6.1 output: pmovsxwd%xmm0, %xmm1 punpckhqdq %xmm0, %xmm0 pmovsxwd%xmm0, %xmm0 paddd %xmm0, %xmm1 ... In a more complicated case, when inlining this code or not, there's actually a difference between gcc 4.9 and 5.x: gcc5 has the extra movdqa in more cases. See my attachment, copied from https://godbolt.org/g/e8iQsj
[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 --- Comment #3 from Peter Cordes --- Seems to be fixed in gcc7.2.0: https://godbolt.org/g/jRwtZN gcc7.2 is fine with -m32, -mx32, and -m64, but x32 is the most compact. -m64 just calls __atomic_load_16 gcc7.2 -O3 -mx32 output: follow_nounion(std::atomic*): movq(%edi), %rax movl%eax, %eax ret vs. gcc7.1 -O3 -mx32 follow_nounion(std::atomic*): movq(%edi), %rcx xorl%edx, %edx movzbl %ch, %eax movb%cl, %dl movq%rcx, %rsi movb%al, %dh andl$16711680, %esi andl$4278190080, %ecx movzwl %dx, %eax orq %rsi, %rax orq %rcx, %rax ret --- gcc7.2 -O3 -m64 just forwards its arg to __atomic_load_16 and then returns: follow_nounion(std::atomic*): subq$8, %rsp movl$2, %esi call__atomic_load_16 addq$8, %rsp ret It unfortunately doesn't optimize the tail-call to movl$2, %esi jmp __atomic_load_16 presumably because it hasn't realized early enough that it takes zero instructions to extract the 8-byte low half of the 16-byte __atomic_load_16 return value.
[Bug inline-asm/82001] New: [5/6/7/8 regression] wrong code when two functions differ only in inline asm register constraints
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82001 Bug ID: 82001 Summary: [5/6/7/8 regression] wrong code when two functions differ only in inline asm register constraints Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: inline-asm Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* When a single compilation unit contains two functions that are identical other than specific-register constraints in an asm statement, they are incorrectly treated as exactly identical and the same code is emitted for both. This happens at -O2 or higher, not -O1. I was able to construct this test-case with two functions that are both plausibly useful. (Although this actually came up while discussing a beginner SO question about inline asm. https://stackoverflow.com/questions/45910530/how-to-write-a-short-block-of-inline-gnu-extended-assembly-to-swap-the-values-of#comment78780443_45910796) int mullo(int a, int b) { asm("mul %%edx # %%1 was %1" : "+a" (a), "+d" (b)); return a; } int mulhi(int a, int b) { asm("mul %%edx # %%1 was %1" : "+d" (a), "+a" (b)); return a; } gcc8.0.0-snapshot 20170827 -O3 (https://godbolt.org/g/CYjnGg) compiles them both to movl%edi, %eax # a, a movl%esi, %edx # b, b mul %edx # %1 was %edx# b ret Any difference in the asm string, or in the clobber registers, makes them not match. Also, an "m" constraint is seen as different from an "r" or specific-register constraint, but "imr" can "match" an "r" In gcc6/7/8, both functions use the correct asm for the 1st function. Swapping the order changes the asm to the other one. In gcc5.x, both functions use the correct asm for the 2nd function. In gcc4.9.4, both functions are compiled correctly.
[Bug target/53687] _mm_cmpistri generates redundant movslq %ecx,%rcx on x86-64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53687 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1 from Peter Cordes --- This behaviour is pretty understandable. gcc doesn't know that the return-value range is only 0-16, i.e. guaranteed non-negative integers. Since you used a signed int offset, makes sense that it *sign* extends from 32 to 64. If you use unsigned offset, the missed-optimization becomes more obvious. gcc7.2 still uses a movl%ecx, %ecx to zero-extend into rcx. https://godbolt.org/g/wWvqpa (Incidentally, same,same is the worst possible choice of registers for Intel CPUs. It means the mov can never be eliminated in the rename stage, and always needs an execution port with non-zero latency.) Even uintptr_t offset doesn't avoid it, because then the conversion from the intrinsic to the variable results in sign-extension up to 64-bit. It treats it exactly like a function that returns int, which in the SysV ABI is allowed to have garbage in the upper32. (BTW, this use of flags from inline asm is not guaranteed to be safe. Nothing stops the optimizer from doing the pointer-increment after the `pcmpistri`, which would clobber flags. You could do `pcmpistri` inside the asm and produce a uintptr_t output operand, except that doesn't work with goto. So really you should write the whole loop in inline asm) Or better, don't use inline asm at all: gcc can CSE _mm_cmpistri with _mm_cmpistra, so you can just use the intrinsic twice to get multiple operands, and it will compile to a single instruction. This is like using `/` and `%` operators to get both results of a `div`.
[Bug target/65146] alignment of _Atomic structure member is not correct
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4 from Peter Cordes --- Created attachment 42125 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=42125&action=edit C11 / pthreads test case for tearing of atomic_llong. compile with -m32 -pthread This is a real bug. See attached C11 testcase for atomic_llong tearing of loads/stores in practice on real x86 hardware with gcc -m32. (It also compiles as C++11 to show the difference). One thread writes 0 and -1 alternating, the other thread reads and checks that the value is either 0 or -1. (Or with `double`, 0 and -1.1, or with a two-member struct, {-1,-1}). Compile with gcc -m32 -march=native -O3 -pthread atomic_llong-underaligned-in-struct.c && ./a.out offset of x in AtomicStruct = 60. addr=0x565a80bc. lockfree() = 1 sizeof(test_type) = 8. test_type = long long alignof(AtomicStruct) = 4, alignof(atomic_ttype) = 4, alignof(test_type) = 4 found tearing: tmp = 0x00 If there's a problem, the whole program uses under a millisecond of CPU time, probably most of it on startup and printing. (e.g. perf counters for machine_clears.memory_ordering = 72, so it didn't spend long in the read loop with the writer going). I get the object to cross a cache-line boundary if the type is under-aligned by using struct AtomicStruct { char filler[57]; atomic_ttype x; }; and using alignas(64). In the x86 32-bit System V ABI, gcc under-aligns an atomic_llong so it can split across the boundary between two cache lines. This makes loads and stores non-atomic on every CPU (except uniprocessor of course). Some AMD CPUs are potentially non-atomic when 8B or 16B boundaries are crossed. (https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic). Here are some ways to fix this for the i386 System V ABI. (I think the Windows 32-bit ABI aligns long long and double to 64b, but an _Atomic struct still potentially needs more than its default alignment to be efficiently lock-free). 1. Change the C stdatomic ABI to match the C++ std::atomic ABI, requiring lock-free atomic objects to be naturally aligned. (But don't change anything for non-atomic objects. The i386 SysV ABI only aligns 64-bit long long to 4B, and we do need to preserve that.) 2. Use lock cmpxchg8b to implement .load() and .store() instead of SSE or x87 load or store instructions on all 8-byte objects, like for 16-byte objects in x86-64 with cmpxchg16b. 3. Make 64-bit objects check alignment before using lock-free sequences, otherwise use locking (or lock cmpxchg8b). Checks can be optimized out in cases where the compiler can prove that an object is 8B-aligned. e.g. an object with static storage since we get to align it. Unless we're linking with code compiled by an old gcc that under-aligned static 64b objects. 4. Make 64-bit objects never lock-free in the i386 SysV ABI. 5. Option 4 + define a new 32-bit ABI that doesn't suck. (pass args in registers, use SSE FP, etc., and align 64-bit types to 64-bit.) Not realistic because nobody cares enough about 32-bit code outside of Windows, so the small benefit wouldn't justify the pain of having 2 incompatible ABIs. Clang is already doing option 1, so gcc and clang are currently incompatible for struct layout for structs with a C11 _Atomic member. Option 1 is by far the best long-term option for performance and simplicity. (Not counting option 5). Option 2 will work, but is always horrible for performance with pure loads or non-seq-cst stores, even with aligned objects. lock cmpxchg8b is atomic even when it crosses a cache-line or page boundary (like all x86 atomic RMW operations using the lock prefix), or whatever boundary is atomic for regular loads/stores. This is **catastrophic** for performance, though, because instead of just internally locking a line of L1D cache (by delaying responses to Invalidate its copy), the CPU has to make sure the change to both cache lines propagates all the way to memory, I think. (x86 locked instructions are atomic with respect to I/O and DMA observers, not just other CPUs, so it can't just keep both cache lines locked). On my Skylake i7-6700k, it's literally a 132x slowdown for a single thread doing `lock add` aligned vs. crossing a cache line boundary. These penalties will happen by chance for more 8B objects on AMD hardware if crossing a 16B or 32B boundary really is non-atomic for regular loads/stores, instead of only having a penalty at 64B boundaries. 004000e0 <_start.loop>: 4000e0: f0 48 83 47 3f 01 lock add QWORD PTR [rdi+0x3f],0x1 ## rdi is page-aligned 4000e6: f0 48 83 47 7f 01 loc
[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 --- Comment #7 from Peter Cordes --- C++11 std::atomic<> is correct, and the change was necessary. 8B alignment is required for 8B objects to be efficiently lock-free (using SSE load / store for .load() and .store(), see https://stackoverflow.com/questions/36624881/why-is-integer-assignment-on-a-naturally-aligned-variable-atomic), and to avoid a factor of ~100 slowdown if lock cmpxchg8b is split across a cache-line boundary. What needs to change is the C11 stdatomic default alignment for 64-bit objects in and out of structs. (This includes _Atomic struct {int a,b;};) Currently, atomic_llong **is not atomic** in gcc, only in g++. I attached a testcase showing tearing to the still-unfixed C11 bug (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146#c4). (It was filed at the same time as the C++11 bug that led to the change in std::atomic.) re: long double: it can't be lock-free in -m32. 10-byte x87 load / store instructions are not guaranteed to be atomic, and in fact even on real Intel CPUs are done as two separate load or store uops. alignof(long double) in 32-bit is different from alignof(long double) in 64-bit. std::atomic or _Atomic long double should always have the same alignment as long double.
[Bug target/65146] alignment of _Atomic structure member is not correct
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146 --- Comment #6 from Peter Cordes --- My test-case on godbolt: https://godbolt.org/g/MmLycw. gcc8 snapshot still only has 4B alignment Fun fact: clang4.0 -m32 inlines lock cmpxchg8b for 8-byte atomic load/store. This is ironic, because it *does* align _Atomic 64-bit objects to 8 bytes so it could safely use SSE loads/stores. It would work correctly if called from gcc-compiled code that passed it a misaligned atomic_llong *. But since gcc and clang don't agree on i386 SysV struct layout for _Atomic 64-bit members, so clang should really just start using movq for 64-bit atomic objects in 32-bit mode.
[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 --- Comment #11 from Peter Cordes --- (In reply to Thiago Macieira from comment #10) > Actually, PR 65146 points out that the problem is not efficiency but > correctness. An under-aligned type could cross a cacheline boundary and thus > fail to be atomic in the first place. As I pointed out there, you technically could solve the correctness problem by checking alignment and falling back to locking for objects where a plain 8B load or 8B store wouldn't be atomic. That's what I meant by "efficiently lock-free". And we're talking about *huge* inefficiencies here compared to always being able to inline and SSE load/store. That would let you keep struct layouts the same, but it would still be an ABI change, since everything has to agree about which objects are lock-free and which aren't. Now that I think about it, all of my suggested fixes on PR 65146 are effectively ABI changes. > Those structures were disasters waiting to happen. Yes, exactly. Basically any existing binaries compiled with a gcc that allows under-aligned atomic objects are unsafe, so keeping compatibility with them is not important.
[Bug target/65146] alignment of _Atomic structure member is not correct
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146 --- Comment #8 from Peter Cordes --- BTW, all of my proposals are really ABI changes, even if struct layout stays the same. All code has to agree on which objects are lock-free or not, and whether they need to check alignment before using an SSE load instead of lock cmpxchg8b or something. It won't be safe to link with binaries compiled with old gcc that assumes a simple SSE load/store is atomic on an 8B atomic_llong*, if the new design can still pass it underaligned pointers. The existing ABI is broken. Some code may happen to not be affected, especially when running on Intel hardware (where only 64B boundaries matter, not 8B boundaries for x86 in general). Or because they only depend on atomic RMW being atomic, not pure load or pure store, so they just take the ~100x performance hit without losing correctness in cases where a boundary is crossed.
[Bug tree-optimization/82135] New: Missed constant propagation through possible unsigned wraparound, with std::align() variable pointer, constant everything else.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82135 Bug ID: 82135 Summary: Missed constant propagation through possible unsigned wraparound, with std::align() variable pointer, constant everything else. Product: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- The code in this report is easiest to look at here: https://godbolt.org/g/DffP3J, with asm output. When g++ inlines this (copied version of std::align from include/c++/memory), it fails to optimize to just rounding up to the next power of 2 when align=size=64 and space=1024, but ptr is variable. (If __ptr is also constant, it's fine.) #include #include inline void* libalign(size_t __align, size_t __size, void*& __ptr, size_t& __space) noexcept { const auto __intptr = reinterpret_cast(__ptr); const auto __aligned = (__intptr - 1u + __align) & -__align; //if (__aligned < __size) __builtin_unreachable(); const auto __diff = __aligned - __intptr; //if (__diff > __size) __builtin_unreachable(); if ((__size + __diff) > __space) return (void*)123456; //nullptr; // non-zero constant is obvious in the asm else { __space -= __diff; return __ptr = reinterpret_cast(__aligned); } } void *libalign64(void *voidp) { std::size_t len = 1024; //if (voidp+len < voidp) __builtin_unreachable(); // doesn't help voidp = libalign(64, 64, voidp, len); return voidp; } g++ -O3 -std=c++14 -Wall -Wextra (trunk 8.0.0 20170906) # x86-64. Other targets do the same compare/cmov or branch leaq63(%rdi), %rax andq$-64, %rax movq%rax, %rdx subq%rdi, %rdx addq$65, %rdx cmpq$1025, %rdx movl$123456, %edx cmovnb %rdx, %rax ret libalign64 gives exactly the same result as just rounding up to the next power of 2 (including wrapping around to zero with addresses very close to the top). But gcc doesn't spot this, I think getting confused about what can happen with unsigned wraparound. char *roundup2(char *p) { auto t = (uintptr_t)p; t = (t+63) & -64; return (char*)t; } leaq63(%rdi), %rax andq$-64, %rax ret For easy testing, I made wrappers that call with a constant pointer, so I can test that it really does wrap around at exactly the same place as roundup2(). (It does: libalign64(-64) = -64, libalign64(-64) = 0.) So it can safely be compiled to 2 instructions on targets where unsigned integer wraparound works normally, without all that adding constants and comparing against constants. static char* const test_constant = (char*)-63ULL; char *test_roundup2() { return roundup2(test_constant); } void *test_libalign() { return libalign64(test_constant); } Uncommenting this line I added: if (__diff > __size) __builtin_unreachable(); lets it compile to just two instructions, but that condition isn't really always true. __diff will be huge when __aligned wraps around. clang, icc, and msvc also fail to make this optimization. IDK if it's particularly useful in real life for anything other than abusing std::align as a simple round-up function.
[Bug target/80568] x86 -mavx256-split-unaligned-load (and store) is affecting AVX2 code, but probably shouldn't be.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 Peter Cordes changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|--- |DUPLICATE --- Comment #3 from Peter Cordes --- Bug 78762 is asking for the same thing: disable at least load-splitting in -mtune=generic when -mavx2 is enabled. Or more generally, ISA-aware tune=generic. *** This bug has been marked as a duplicate of bug 78762 ***