[Bug target/78762] Regression: Splitting unaligned AVX loads also when AVX2 is enabled

2017-09-07 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #16

[Bug target/82136] New: x86: -mavx256-split-unaligned-load should try to fold other shuffles into the load/vinsertf128

2017-09-07 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* static const int aligned = 0

[Bug tree-optimization/82137] New: auto-vectorizing shuffles way to much to avoid duplicate work

2017-09-07 Thread peter at cordes dot ca
-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- (same code as bug 82136, but with aligned pointers, and discussing the overall vectorization

[Bug target/82136] x86: -mavx256-split-unaligned-load should try to fold other shuffles into the load/vinsertf128

2017-09-07 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136 --- Comment #1 from Peter Cordes --- Whoops, the compiler-explorer link had aligned=1. This one produces the asm I showed in the original report: https://godbolt.org/g/WsZ5S9 See bug 82137 for a much more efficient vectorization strategy gcc sh

[Bug target/82139] New: unnecessary movapd with _mm_castsi128_pd to use BLENDPD on __m128i results

2017-09-07 Thread peter at cordes dot ca
: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include #include // stripped down from a real

[Bug tree-optimization/82142] New: struct zeroing should use wide stores instead of avoiding overwriting padding

2017-09-08 Thread peter at cordes dot ca
: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- (not sure if tree-optimization is the right "component", pl

[Bug tree-optimization/82135] Missed constant propagation through possible unsigned wraparound, with std::align() variable pointer, constant everything else.

2017-09-08 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82135 --- Comment #2 from Peter Cordes --- (In reply to Marc Glisse from comment #1) > This PR is a bit messy, please minimize your examples... Sorry, looking at it again later I could have done better. I thought it was somewhat relevant that this wa

[Bug target/67458] x86: atomic store with memory_order_release doesn't order other stores

2017-09-08 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67458 --- Comment #5 from Peter Cordes --- > optabs: ensure atomic_load/stores have compiler barriers Thanks for taking a look at this report. But I think it's not necessary to have a full 2-way barrier. If there's a lighter-weight way to get the be

[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2017-09-08 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 --- Comment #13 from Peter Cordes --- (In reply to Thiago Macieira from comment #12) > Another problem is that we've now had a couple of years with this issue, so > it's probably worse to make a change again. A change to C++11 std::atomic? Yeah

[Bug target/82158] New: _Noreturn functions that do return clobber caller's registers on ARM32 (but not other arches)

2017-09-08 Thread peter at cordes dot ca
https://godbolt.org/g/GhW4b8 Status: UNCONFIRMED Keywords: wrong-code Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target

[Bug target/82158] _Noreturn functions that do return clobber caller's registers on ARM32 (but not other arches)

2017-09-08 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158 --- Comment #1 from Peter Cordes --- Related: bug 55747 describes why gcc keeps the `push {r4, lr}` in the _Noreturn function: backtraces.

[Bug tree-optimization/82137] auto-vectorizing shuffles way to much to avoid duplicate work

2017-09-12 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137 --- Comment #2 from Peter Cordes --- (In reply to Richard Biener from comment #1) > Interesting idea. It's probably a bit hard to make the vectorizer do this > though given it's current structure and the fact that it would have to > cost the ext

[Bug target/82136] x86: -mavx256-split-unaligned-load should try to fold other shuffles into the load/vinsertf128

2017-09-12 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136 --- Comment #3 from Peter Cordes --- (In reply to Richard Biener from comment #2) > And it gets worse because of the splitting > which isn't exposed to the vectorizer. Split loads/stores can be a useful shuffling strategy even on Haswell/Skylake

[Bug target/82158] _Noreturn functions that do return clobber caller's registers on ARM32 (but not other arches)

2017-09-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158 --- Comment #3 from Peter Cordes --- (In reply to jos...@codesourcery.com from comment #2) > Falling off a noreturn function sounds like it could be another case to > insert __builtin_trap (), as we do in various cases of undefined behavior. gc

[Bug target/82158] _Noreturn functions that do return clobber caller's registers on ARM32 (but not other arches)

2017-09-16 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158 --- Comment #5 from Peter Cordes --- (In reply to Ramana Radhakrishnan from comment #4) > It's a "feature" - if the function really doesn't return, then there is no > real requirement to save and restore all callee-saved registers. > > A deliber

[Bug target/82227] New: ARM thumb inefficient tailcall return sequence (multiple pops)

2017-09-16 Thread peter at cordes dot ca
-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: arm*-*-* int ext(); int tailcall_external() { return ext(); } // https

[Bug target/71725] Backend decides to generate larger and possibly slower float ops for integer ops that appear in source

2017-09-16 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71725 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1

[Bug target/82245] New: [x86] missed optimization: (int64_t) i32 << constant on 32-bit machines can combine shift + sign extension like on other arches

2017-09-18 Thread peter at cordes dot ca
t: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: ---

[Bug target/47769] [missed optimization] use of btr (bit test and reset)

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #6

[Bug target/82259] New: missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-19 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* bool bt_signed(int x, unsigned bit

[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259 --- Comment #1 from Peter Cordes --- More generally, you can flip a higher bit while copying with lea 64(%rdi), %eax That leaves the bits above that position munged by carry-out, but that isn't always a problem.

[Bug target/82260] New: [x86] Unnecessary use of 8-bit registers with -Os. slightly slower and larger code

2017-09-19 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* int shift(int x, int c) { return

[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259 --- Comment #3 from Peter Cordes --- Oops, BT sets CF, not ZF. So bt $13, %edi setnc %al# aka setae ret This is what clang does for the bt_ functions, and might be optimal for many use-cases. (For br

[Bug target/82261] New: x86: missing peephole for SHLD / SHRD

2017-09-19 Thread peter at cordes dot ca
Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* unsigned shld(unsigned a, unsigned b, unsigned n){ //n=13; a <<= n; b >

[Bug target/82259] missed optimization: use LEA to add 1 to flip the low bit when copying before AND with 1

2017-09-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259 --- Comment #4 from Peter Cordes --- (In reply to Uroš Bizjak from comment #2) > A couple of *scc_bt patterns are missing. These are similar to already > existing *jcc_bt patterns. Combine wants: Does gcc also need patterns for bt + cmovcc? Thi

[Bug target/82267] New: x32: unnecessary address-size prefixes. Why isn't -maddress-mode=long the default?

2017-09-19 Thread peter at cordes dot ca
IRMED Keywords: ABI, missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-* x32 defaults to using 32-bit address

[Bug target/82260] [x86] Unnecessary use of 8-bit registers with -Os. slightly slower and larger code

2017-09-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82260 --- Comment #4 from Peter Cordes --- (In reply to Jakub Jelinek from comment #2) > From pure instruction size POV, for the first 2 alternatives as can be seen > say on: > ... > movb $0x15, %al > movl $0x15, %eax > movb $-0x78, %bl > movl $-0x78,

[Bug target/82260] [x86] Unnecessary use of 8-bit registers with -Os. slightly slower and larger code

2017-09-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82260 --- Comment #5 from Peter Cordes --- > (not (match_test "TARGET_PARTIAL_REG_STALL")) gcc is doing this even with -mtune=core2. Core2 / Nehalem stall (the front-end) for 2-3 cycles to insert a merging uop when reading a full register after w

[Bug target/82158] _Noreturn functions that do return clobber caller's registers on ARM32 (but not other arches)

2017-09-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158 Peter Cordes changed: What|Removed |Added Status|RESOLVED|VERIFIED --- Comment #7 from Peter Cordes

[Bug target/82158] _Noreturn functions that do return clobber caller's registers on ARM32 (but not other arches)

2017-09-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158 Peter Cordes changed: What|Removed |Added Status|VERIFIED|UNCONFIRMED Resolution|WONTFIX

[Bug target/82158] _Noreturn functions that do return clobber caller's registers on ARM32 (but not other arches)

2017-09-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158 --- Comment #10 from Peter Cordes --- (In reply to Jakub Jelinek from comment #9) > None of the above options is IMHO acceptable. > This is UB like any other. I still think it's a quality-of-implementation bug that could be fixed without downsid

[Bug target/81602] Unnecessary zero-extension after 16 bit popcnt

2017-09-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2

[Bug target/81602] Unnecessary zero-extension after 16 bit popcnt

2017-09-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602 --- Comment #3 from Peter Cordes --- Forgot to mention: memory-source popcnt with an indexed addressing mode would also be worse on SnB/IvB: it can't stay micro-fused, so the front-end un-laminates it in the issue stage. Haswell and later can ke

[Bug target/82281] New: Bulldozer/Zen tuning: uses XMM for single 64-bit integer AND, even with a simple mask

2017-09-21 Thread peter at cordes dot ca
Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- long long test_and(long long x) { return x & 0x77fff

[Bug target/82298] New: x86 BMI: no peephole for BZHI

2017-09-22 Thread peter at cordes dot ca
Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* gcc never seems to emit BZHI on its own. // exact BZHI behaviour for all inputs (with no C UB) unsigned

[Bug target/82328] New: x86 rdrand: flags not used directly when branching on success/failure

2017-09-26 Thread peter at cordes dot ca
-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include unsigned long long use_intrinsic(void

[Bug target/82158] _Noreturn functions that do return clobber caller's registers on ARM32 (but not other arches)

2017-09-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158 Peter Cordes changed: What|Removed |Added Resolution|INVALID |WONTFIX --- Comment #11 from Peter Cordes

[Bug target/82267] x32: unnecessary address-size prefixes. Why isn't -maddress-mode=long the default?

2017-09-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267 --- Comment #6 from Peter Cordes --- (In reply to H.J. Lu from comment #2) > > Are there still cases where -maddress-mode=long makes worse code? > > > Yes, there are more places where -maddress-mode=long needs to zero-extend > address to 64 bit

[Bug target/68924] No intrinsic for x86 `MOVQ m64, %xmm` in 32bit mode.

2017-09-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68924 --- Comment #1 from Peter Cordes --- There's __m128i _mm_loadl_epi64 (__m128i const* mem_addr)(https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=movq&expand=5450,4247,3115&techs=SSE2), which gcc makes available in 32-bit mode.

[Bug target/82339] Inefficient movabs instruction

2017-09-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339 --- Comment #4 from Peter Cordes --- (In reply to Jakub Jelinek from comment #0) > At least on i7-5960X in the following testcase: > > baz is fastest as well as shortest. > So I think we should consider using movl $cst, %edx; shlq $shift, %rdx >

[Bug target/82339] Inefficient movabs instruction

2017-09-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339 --- Comment #5 from Peter Cordes --- (In reply to Richard Biener from comment #2) > I always wondered if it is more efficient to have constant pools per function > in .text so we can do %rip relative loads with short displacement? There's no rel

[Bug target/68924] No intrinsic for x86 `MOVQ m64, %xmm` in 32bit mode.

2017-09-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68924 --- Comment #3 from Peter Cordes --- (In reply to Marc Glisse from comment #2) > Does anything bad happen if you remove the #ifdef/#endif for > _mm_cvtsi64_si128? (2 files in the testsuite would need updating for a > proper patch) It's just a wr

[Bug target/80568] New: x86 -mavx256-split-unaligned-load (and store) is affecting AVX2 code, but probably shouldn't be.

2017-04-29 Thread peter at cordes dot ca
IRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Created attachment 41285 --> https://gcc.gnu.org/

[Bug target/80570] New: auto-vectorizing int->double conversion should use half-width memory operands to avoid shuffles, instead of load+extract

2017-04-29 Thread peter at cordes dot ca
gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Tar

[Bug target/80571] New: AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops

2017-04-30 Thread peter at cordes dot ca
gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Tar

[Bug target/80571] AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops

2017-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571 --- Comment #1 from Peter Cordes --- Tracking "cold" registers that are safe to use as a read-only source (whether or not they're holding a useful value like a constant) has a couple other applications for x86: * vcvtsi2ss/sd %src,%merge_into,

[Bug target/80586] New: vsqrtss with AVX should avoid a dependency on the destination register.

2017-05-01 Thread peter at cordes dot ca
-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include float sqrt_depcheck(float a, float b

[Bug target/80568] x86 -mavx256-split-unaligned-load (and store) is affecting AVX2 code, but probably shouldn't be.

2017-05-02 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 --- Comment #2 from Peter Cordes --- Using ISA-extension options removes some microarchitectures from the set of CPUs that can run the code, so it would be appropriate for them to have some effect on tuning. A "generic AVX2 CPU" is much more spe

[Bug target/80636] New: AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

2017-05-04 Thread peter at cordes dot ca
: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* Currently, gcc compiles _mm256_setzero_ps() to vxorps %ymm0, %ymm0, %ymm0, or zmm for _mm512_setzero_ps. And similar for pd and integer vectors, using a vector

[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

2017-05-05 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636 --- Comment #2 from Peter Cordes --- > The same possibly applies to all "zero-extending" moves? Yes, if a vmovdqa %xmm0,%xmm1 will work, it's the best choice on AMD CPUs, and doesn't hurt on Intel CPUs. So in any case where you need to copy a

[Bug target/80813] New: x86: std::vector::operator[] could be somewhat faster using BT instead of SHL

2017-05-17 Thread peter at cordes dot ca
: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* This actually applies to all cases of testing

[Bug target/80819] New: [5/6/7/8 regression] Useless store to the stack in _mm_set_epi64x with SSE4 -mno-avx

2017-05-18 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include __m128i combine64(long long

[Bug target/80820] New: _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it.

2017-05-18 Thread peter at cordes dot ca
: gcc Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- T

[Bug target/80819] [5/6/7/8 regression] Useless store to the stack in _mm_set_epi64x with SSE4 -mno-avx

2017-05-18 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80819 --- Comment #2 from Peter Cordes --- (In reply to Andrew Pinski from comment #1) > >-mtune=generic still stores/reloads instead of using movd for %edi and %edx, > >which is worse for most CPUs. > Worse on most Intel but not most AMD CPUs. You

[Bug target/80833] New: 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2017-05-19 Thread peter at cordes dot ca
ion Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i?86-*-* This affects 64-bit atomic loads/stores, as well as _mm_set_epi

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2017-05-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #1 from Peter Cordes --- See https://godbolt.org/g/krXH9M for the functions I was looking at.

[Bug target/80820] _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it.

2017-05-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 --- Comment #2 from Peter Cordes --- See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833. gcc -m32 does an even worse job of getting int64_t into an xmm reg, e.g. as part of a 64-bit atomic store. We get a store-forwarding failure from

[Bug target/80820] _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it.

2017-05-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 --- Comment #3 from Peter Cordes --- Also, going the other direction is not symmetric. On some CPUs, a store/reload strategy for xmm->int might be better even if an ALU strategy for int->xmm is best. Also, the choice can depend on chunk size, s

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2017-05-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #2 from Peter Cordes --- On most CPUs, psrldq / movd is optimal for xmm[1] -> int without SSE4. On SnB-family, movd runs on port0, and psrldq can run on port5, so they can execute in parallel. (And the second movd can run the next c

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2017-05-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #3 from Peter Cordes --- Atom's movd xmm->int is slower (lat=4, rtput=2) than its movd int->xmm (lat=3, rtput=1), which is opposite of every other CPU (except Silvermont where they're the same throughput but xmm->int is 1c slower). S

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2017-05-19 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #4 from Peter Cordes --- I don't think it's worth anyone's time to implement this in 2017, but using MMX regs for 64-bit store/load would be faster on really old CPUs that split 128b vectors insns into two halves, like K8 and Pentium

[Bug libstdc++/80835] New: Reading a member of an atomic can load just that member, not the whole struct

2017-05-20 Thread peter at cordes dot ca
: missed-optimization Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- For std::atomic or similar small struct, accessing just one member with foo.load().m

[Bug target/80837] New: [7.1.0 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes

2017-05-20 Thread peter at cordes dot ca
Version: 7.1.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target

[Bug target/80636] AVX / AVX512 register-zeroing should always use AVX 128b, not ymm or zmm

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636 --- Comment #3 from Peter Cordes --- The point about moves also applies to integer code, since a 64-bit mov requires an extra byte for the REX prefix (unless a REX prefix was already required for r8-r15). I just noticed a case where gcc uses a 6

[Bug target/70490] __atomic_load_n(const __int128 *, ...) generates CMPXCHG16B with no warning

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #5

[Bug libstdc++/80835] Reading a member of an atomic can load just that member, not the whole struct

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 Peter Cordes changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill

[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2017-05-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #5

[Bug target/80844] New: OpenMP SIMD doesn't know how to efficiently zero a vector (its stores zeros and reloads)

2017-05-20 Thread peter at cordes dot ca
IRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- float sumfloat_omp(const float arr[]) { float sum=0; #pragma omp

[Bug target/80846] New: auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-20 Thread peter at cordes dot ca
Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i

[Bug libstdc++/80835] Reading a member of an atomic can load just that member, not the whole struct

2017-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 --- Comment #3 from Peter Cordes --- (In reply to Jonathan Wakely from comment #2) > You've reported this against libstdc++ I had to take a guess at the right component, based on a couple other std::atomic bugs I looked at. Apparently I picked

[Bug c++/80835] Reading a member of an atomic can load just that member, not the whole struct

2017-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 --- Comment #4 from Peter Cordes --- Thanks for correcting my mistake in tagging this bug, but this got me thinking it's not just a C++ issue. This also applies to GNU C __atomic_load_n(), and ISO C11 stdatomic code like #include #include uin

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2017-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #6 from Peter Cordes --- (In reply to Richard Biener from comment #5) > There's some related bugs. I think there is no part of the compiler that > specifically tries to avoid store forwarding issues. Ideally the compiler would keep

[Bug c++/80835] Reading a member of an atomic can load just that member, not the whole struct

2017-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835 --- Comment #5 from Peter Cordes --- Previous godbolt link was supposed to be: https://godbolt.org/g/78kIAl which includes the CAS functions.

[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes

2017-05-23 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 --- Comment #2 from Peter Cordes --- (In reply to Richard Biener from comment #1) > GCC 8 generates a __atomic_load_16 call for me while GCC 6 does > > lock cmpxchg16b (%rdi) That's expected. See https://gcc.gnu.org/ml/gcc-patches/2017

[Bug tree-optimization/80844] OpenMP SIMD doesn't know how to efficiently zero a vector (its stores zeros and reloads)

2017-05-23 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844 --- Comment #3 from Peter Cordes --- (In reply to Jakub Jelinek from comment #2) > It doesn't always zero, it can be pretty arbitrary. Is if feasible have it just load the first vector of elements, instead of broadcasting the identity value? i.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2017-05-24 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #2 from Peter Cordes --- (In reply to Richard Biener from comment #1) > That is, it was supposed to end up using pslldq I think you mean PSRLDQ. Byte zero is the right-most when drawn in a way that makes bit/byte shift directions al

[Bug target/78855] New: -mtune=generic should keep cmp/jcc together. AMD and Intel both macro-fuse

2016-12-18 Thread peter at cordes dot ca
: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86-64-*-* -mtune=generic and -mtune=intel currently don't opt

[Bug tree-optimization/78947] New: sub-optimal code for (bool)(int ? int : int)

2016-12-28 Thread peter at cordes dot ca
Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* Full version: http://stackoverflow.com/questions/41323911/why-the-difference-in-code

[Bug tree-optimization/82356] New: auto-vectorizing pack of 16->8 has a redundant AND after a shift

2017-09-28 Thread peter at cordes dot ca
sed-optimization, ssemmx Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include void pack_high8_basel

[Bug target/82369] New: "optimizes" indexed addressing back into two pointer increments

2017-09-29 Thread peter at cordes dot ca
ds: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* gcc defeats this attempt to get it to reduce the

[Bug target/82370] New: AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.

2017-09-29 Thread peter at cordes dot ca
words: missed-optimization, ssemmx Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include #include #include

[Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.

2017-10-03 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370 --- Comment #2 from Peter Cordes --- (In reply to Jakub Jelinek from comment #1) > Created attachment 42296 [details] > gcc8-pr82370.patch > > If VPAND is exactly as fast as VPANDQ except for different encodings, then > maybe we can do something

[Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.

2017-10-03 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370 --- Comment #3 from Peter Cordes --- Doesn't change the performance implications, but I just realized I have the offset-load backwards. Instead of vpsrlw $8, (%rsi), %xmm1 vpand 15(%rsi), %xmm2, %xmm0 this algorithm should us

[Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.

2017-10-04 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370 --- Comment #4 from Peter Cordes --- VPANDQ can be shorter than an equivalent VPAND, for displacements > 127 but <= 16 * 127 or 32 * 127, and that are an exact multiple of the vector width. EVEX with disp8 always implies a compressed displacemen

[Bug tree-optimization/82432] New: Missed constant propagation of return values of non-inlined static functions

2017-10-04 Thread peter at cordes dot ca
: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- static __attribute((noinline)) int get_constant() { /* optionally stuff

[Bug tree-optimization/82432] Missed constant propagation of return values of non-inlined static functions

2017-10-04 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82432 --- Comment #1 from Peter Cordes --- Meant to add https://godbolt.org/g/K9CxQ6 before submitting. And to say I wasn't sure tree-optimization was the right component. I did check that -flto didn't do this optimization either. Is it worth openin

[Bug target/82370] AVX512 can use a memory operand for immediate-count vpsrlw, but gcc doesn't.

2017-10-06 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370 --- Comment #5 from Peter Cordes --- I got off topic with this bug. It was supposed to be about emitting vpsrlw $8, (%rsi), %xmm1# load folded into AVX512BW version instead of vmovdqu64 (%rsi), %xmm0 # or VEX vmovdqu;

[Bug target/82459] New: AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2017-10-06 Thread peter at cordes dot ca
Version: 8.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* gcc bottlenecks on

[Bug target/82460] New: AVX512: choose between vpermi2d and vpermt2d to save mov instructions. Also, fails to optimize away shifts before shuffle

2017-10-06 Thread peter at cordes dot ca
Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2017-10-06 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #1 from Peter Cordes --- BTW, if we *are* using vpmovwb, it supports a memory operand. It doesn't save any front-end uops on Skylake-avx512, just code-size. Unless it means less efficient packing in the uop cache (since all uops fro

[Bug target/82582] New: not quite optimal code for -2*x*y - 3*z: could use one less LEA for smaller code without increasing critical path latency for any input

2017-10-17 Thread peter at cordes dot ca
Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone

[Bug tree-optimization/82666] New: [7/8 regression]: sum += (x>128 ? x : 0) puts the cmov on the critical path (at -O2)

2017-10-22 Thread peter at cordes dot ca
MED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* long long sumarray(const

[Bug target/82667] New: SSE2 redundant pcmpgtd for sign-extension of values known to be >= 0

2017-10-22 Thread peter at cordes dot ca
sed-optimization, ssemmx Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- long long sumarray(const int *data) { data = (const int*)__builtin_assume_alig

[Bug target/82668] New: could use BMI2 rorx for unpacking struct { int a,b }; from a register (SysV ABI)

2017-10-22 Thread peter at cordes dot ca
: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-* struct twoint { int a, b; }; int bar(struct

[Bug target/82680] Use cmpXXss and cmpXXsd for setcc boolean compare

2017-10-24 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82680 --- Comment #2 from Peter Cordes --- gcc's sequence is *probably* good, as long as it uses xor / comisd / setcc and not comisd / setcc / movzx (which gcc often likes to do for integer setcc). (u)comisd and cmpeqsd both run on the FP add unit. A

[Bug tree-optimization/82729] New: adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

2017-10-26 Thread peter at cordes dot ca
us: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-*

[Bug target/82730] New: extra store/reload of an XMM for every byte extracted

2017-10-26 Thread peter at cordes dot ca
: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include #include #include void p128_as_u8hex(__m128i in) { _Alignas(16

[Bug target/82731] New: _mm256_set_epi8(array[offset[0]], array[offset[1]], ...) byte gather makes slow code, trying to zero-extend all the uint16_t offsets first and spilling them.

2017-10-26 Thread peter at cordes dot ca
at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* #include "immintrin.h" #include "inttypes.h" __m256i gather(char *array, uint16_t *offset) { return _mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]], array[offset[3]], arr

[Bug tree-optimization/82732] New: malloc+zeroing other than memset not optimized to calloc, so asm output is malloc+memset

2017-10-26 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- #include #include int *foo(unsigned size) { int *p

[Bug rtl-optimization/82729] adjacent small objects can be initialized with a single store (but aren't for char a[] = "a")

2017-10-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729 --- Comment #2 from Peter Cordes --- (In reply to Richard Biener from comment #1) > The issue is we have no merging of stores at the RTL level and the GIMPLE > level doesn't know whether the variables will end up allocated next to each > other.

<    1   2   3   >