[Bug c++/35669] NULL (__null) not considered different from 0 with C++

2009-02-12 Thread peter at cordes dot ca
--- Comment #8 from peter at cordes dot ca 2009-02-12 17:56 --- Would it cause any problems for g++ to behave more like a C compiler when it comes to NULL? e.g. I found this bug report after finding that kscope 1.9.1 didn't compile, because it expected NULL to match the void* ve

[Bug target/39942] Nonoptimal code - leaveq; xchg %ax,%ax; retq

2020-04-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #53

[Bug tree-optimization/92243] New: Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available

2019-10-27 Thread peter at cordes dot ca
Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: ---

[Bug tree-optimization/92243] Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243 --- Comment #1 from Peter Cordes --- Forgot to mention, this probably applies to other ISAs with GP-integer byte-reverse instructions and efficient unaligned loads.

[Bug tree-optimization/92244] New: extra sub inside vectorized loop instead of calculating end-pointer

2019-10-27 Thread peter at cordes dot ca
-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- We get a redundant instruction inside the vectorized loop here. But it's

[Bug tree-optimization/92244] extra sub inside vectorized loop instead of calculating end-pointer

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 --- Comment #1 from Peter Cordes --- On AArch64 (with gcc8.2), we see a similar effect, more instructions in the loop. And an indexed addressing mode. https://godbolt.org/z/6ZVWY_ # strrev_explicit -O3 -mcpu=cortex-a53 ... .L4:

[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 Peter Cordes changed: What|Removed |Added Summary|extra sub inside vectorized |vectorized loop updating 2

[Bug target/92246] New: Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)

2019-10-27 Thread peter at cordes dot ca
: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* typedef short

[Bug target/92246] Byte or short array reverse loop auto-vectorized with 3-uop vpermt2w instead of 1 or 2-uop vpermw (AVX512)

2019-10-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246 --- Comment #1 from Peter Cordes --- And BTW, GCC *does* use vpermd (not vpermt2d) for swapt = int or long. This problem only applies to char and short. Possibly because AVX2 includes vpermd ymm. Apparently CannonLake has 1 uop vpermb bu

[Bug tree-optimization/92244] vectorized loop updating 2 copies of the same pointer (for in-place reversal cross in the middle)

2019-10-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244 --- Comment #4 from Peter Cordes --- (In reply to Andrew Pinski from comment #3) > (In reply to Peter Cordes from comment #1) > > On AArch64 (with gcc8.2), we see a similar effect, more instructions in the > > loop. And an indexed addressing mod

[Bug target/82459] AVX512BW instruction costs: vpmovwb is 2 uops on Skylake and not always worth using vs. vpack + vpermq lane-crossing fixup

2019-10-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 Peter Cordes changed: What|Removed |Added See Also||https://gcc.gnu.org/bugzill

[Bug target/89346] Unnecessary EVEX encoding

2019-10-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89346 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1

[Bug target/40838] gcc shouldn't assume that the stack is aligned

2019-10-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #91

[Bug target/93141] Missed optimization : Use of adc when checking overflow

2020-01-03 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93141 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2

[Bug target/89063] [x86] lack of support for BEXTR from BMI extension

2019-01-25 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1

[Bug target/89071] New: AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double

2019-01-26 Thread peter at cordes dot ca
ywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- float cvt(double unused, double xmm1) { return xmm1; } g++ (GCC-Explorer-Build)

[Bug target/80586] vsqrtss with AVX should avoid a dependency on the destination register.

2019-01-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80586 Peter Cordes changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|---

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #2 from Peter Cordes --- (In reply to H.J. Lu from comment #1) > But > > vxorps %xmm0, %xmm0, %xmm0 > vcvtsd2ss %xmm1, %xmm0, %xmm0 > > are faster than both. On Skylake-client (i7-6700k), I can't reproduce this r

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #3 from Peter Cordes --- (In reply to H.J. Lu from comment #1) I have a patch for PR 87007: > > https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html > > which inserts a vxorps at the last possible position. vxorps > will be exe

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #5 from Peter Cordes --- (In reply to H.J. Lu from comment #4) > (In reply to Peter Cordes from comment #2) > > Can you show some > > asm where this performs better? > > Please try cvtsd2ss branch at: > > https://github.com/hjl-to

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #6 from Peter Cordes --- (In reply to Peter Cordes from comment #5) > But whatever the effect is, it's totally unrelated to what you were *trying* > to test. :/ After adding a `ret` to each AVX function, all 5 are basically the same

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-28 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #8 from Peter Cordes --- Created attachment 45544 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45544&action=edit testloop-cvtss2sd.asm (In reply to H.J. Lu from comment #7) > I fixed assembly codes and run it on different AVX

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-01-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #10 from Peter Cordes --- (In reply to Uroš Bizjak from comment #9) > There was similar patch for sqrt [1], I think that the approach is > straightforward, and could be applied to other reg->reg scalar insns as > well, independently o

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #4 from Peter Cordes --- I suspect dep-chains are the problem, and branching to skip work is a Good Thing when it's predictable. (In reply to Richard Biener from comment #2) > On Skylake it's better (1uop, 1 cycle latency) while on R

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #5 from Peter Cordes --- IF ( xij.GT.+HALf ) xij = xij - PBCx IF ( xij.LT.-HALf ) xij = xij + PBCx For code like this, *if we can prove only one of the IF() conditions will be true*, we can implement it

[Bug target/88494] [9 Regression] polyhedron 10% mdbx runtime regression

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494 --- Comment #6 from Peter Cordes --- Oops, these were SD not SS. Getting sleepy >.<. Still, my optimization suggestion for doing both compares in one masked SUB of +-PBCx applies equally. And I think my testing with VBLENDVPS should apply equa

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #15 from Peter Cordes --- (In reply to Uroš Bizjak from comment #13) > I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP > and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we > currently

[Bug target/85366] New: Failure to use both div and mod results of one IDIV in a prime-factor loop while(n%i==0) { n/=i; }

2018-04-11 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* From https

[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2018-04-15 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1

[Bug c++/69560] x86_64: alignof(uint64_t) produces incorrect results with -m32

2018-04-26 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69560 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #23

[Bug target/81274] x86 optimizer emits unnecessary LEA instruction when using AVX intrinsics

2018-04-30 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274 --- Comment #2 from Peter Cordes --- The stray LEA bug seems to be fixed in current trunk (9.0.0 20180429), at least for this testcase. Gcc's stack-alignment strategy seems to be improved overall (not copying the return address when not needed),

[Bug tree-optimization/85585] New: switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a st

2018-05-01 Thread peter at cordes dot ca
Reporter: peter at cordes dot ca Target Milestone: --- Bug 84011 shows some really silly code-gen for PIC code and discussion suggested using a table of offsets instead of a table of actual pointers, so you just need one base address. A further optimization is possible when the strings are

[Bug tree-optimization/85585] switch to select a string based on an enum can profitably optimize away the table of pointers/offsets into fixed-length char[] blocks. Or use byte offsets into a string

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585 --- Comment #1 from Peter Cordes --- By comparison, the no-PIE table of pointers only needs one instruction: movqCSWTCH.4(,%rdi,8), %rax So all my suggestions cost 1 extra instruction on x86 in no-PIE mode, but at a massive savings

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #9

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 --- Comment #12 from Peter Cordes --- (In reply to Jakub Jelinek from comment #10) > (In reply to Peter Cordes from comment #9) > > gcc already totally misses optimizations here where one string is a suffix > > of another. "mii" could just be a

[Bug tree-optimization/84011] Optimize switch table with run-time relocation

2018-05-01 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011 --- Comment #13 from Peter Cordes --- (In reply to Jakub Jelinek from comment #10) > ?? That is the task for the linker SHF_MERGE|SHF_STRINGS handling. > Why should gcc duplicate that? Because gcc would benefit from knowing if merging makes the

[Bug tree-optimization/69615] 0 to limit signed range checks don't always use unsigned compare

2018-06-02 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615 --- Comment #5 from Peter Cordes --- Update: https://godbolt.org/g/ZQDY1G gcc7/8 optimizes this to and / cmp / jb, while gcc6.3 doesn't. void rangecheck_var(int64_t x, int64_t lim2) { //lim2 >>= 60; lim2 &= 0xf; // let the compiler figure

[Bug target/80833] 32-bit x86 causes store-forwarding stalls for int64_t -> xmm

2018-06-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833 --- Comment #14 from Peter Cordes --- I happened to look at this old bug again recently. re: extracting high the low two 32-bit elements: (In reply to Uroš Bizjak from comment #11) > > Or without SSE4 -mtune=sandybridge (anything that excluded

[Bug target/80820] _mm_set_epi64x shouldn't store/reload for -mtune=haswell, Zen should avoid store/reload, and generic should think about it.

2018-06-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820 --- Comment #5 from Peter Cordes --- AVX512F with marge-masking for integer->vector broadcasts give us a single-uop replacement for vpinsrq/d, which is 2 uops on Intel/AMD. See my answer on https://stackoverflow.com/questions/50779309/loading-an

[Bug rtl-optimization/86352] New: setc/movzx introduced into loop to provide a constant 0 value for a later rep stos

2018-06-28 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* The wrong-code bug 86314 also

[Bug tree-optimization/91026] switch expansion produces a jump table with trivial entries

2019-07-29 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #3

[Bug c/91398] Possible missed optimization: Can a pointer be passed as hidden pointer in x86-64 System V ABI

2019-08-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91398 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4

[Bug middle-end/91515] missed optimization: no tailcall for types of class MEMORY

2019-08-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91515 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1

[Bug target/82887] ICE: in extract_insn, at recog.c:2287 (unrecognizable insn) with _mm512_extracti64x4_epi64

2019-10-13 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4

[Bug tree-optimization/92080] New: Missed CSE of _mm512_set1_epi8(c) with _mm256_set1_epi8(c)

2019-10-13 Thread peter at cordes dot ca
Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* As a workaround for PR 82887 some code (e.g. a memset

[Bug target/82887] ICE: in extract_insn, at recog.c:2287 (unrecognizable insn) with _mm512_extracti64x4_epi64

2019-10-13 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887 --- Comment #5 from Peter Cordes --- Reported bug 92080 for the missed CSE

[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes

2017-12-13 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 --- Comment #5 from Peter Cordes --- (In reply to Jakub Jelinek from comment #4) > Can't reproduce. It is true that we now emit the __atomic_load_16 call, but > that was intentional change Yup. >, and it can't be easily tail call, because the

[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes

2017-12-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 --- Comment #6 from Peter Cordes --- (In reply to Jakub Jelinek from comment #4) > But have just tried gcc 7.1.0 release and can't reproduce even there. Matt says the Compiler Explorer backend uses upstream release tarballs like `URL=ftp://ftp.g

[Bug tree-optimization/53947] [meta-bug] vectorizer missed-optimizations

2018-01-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 Bug 53947 depends on bug 80846, which changed state. Bug 80846 Summary: auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 Peter Cordes changed: What|Removed |Added Status|RESOLVED|REOPENED Resolution|FIXED

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #22 from Peter Cordes --- Forgot the Godbolt link with updated cmdline options: https://godbolt.org/g/FCZAEj.

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-14 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #25 from Peter Cordes --- We're getting a spill/reload inside the loop with AVX512: .L2: vmovdqa64 (%esp), %zmm3 vpaddd (%eax), %zmm3, %zmm2 addl$64, %eax vmovdqa64 %zmm2, (%esp)

[Bug target/80846] auto-vectorized AVX2 horizontal sum should narrow to 128b right away, to be more efficient for Ryzen and Intel

2018-01-16 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846 --- Comment #28 from Peter Cordes --- (In reply to Richard Biener from comment #27) > Note that this is deliberately left as-is because the target advertises > (cheap) support for horizontal reduction. The vectorizer simply generates > a single

[Bug target/38959] Additional switches to disallow processor supplementary instructions

2019-02-12 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #3

[Bug target/38959] Additional switches to disallow processor supplementary instructions

2019-02-12 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959 --- Comment #4 from Peter Cordes --- The __builtin_ia32_rdpmc being a pure function bug I mentioned in my previous comment is already reported and fixed (in gcc9 only): bug 87550 It was present since at least gcc 5.0 https://software.intel.com/e

[Bug target/80571] AVX allows multiple vcvtsi2ss/sd (integer -> float/double) to reuse a single dep-breaking vxorps, even hoisting it out of loops

2019-02-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571 --- Comment #2 from Peter Cordes --- I think hjl's patch for PR 89071 / PR 87007 fixes (most of?) this, at least for AVX. If register pressure is an issue, using a reg holding a arbitrary constant (instead of xor-zeroed) is a valid option, as th

[Bug target/89071] AVX vcvtsd2ss lets us avoid PXOR dependency breaking for scalar float<->double and other scalar xmm,xmm instructions

2019-02-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071 --- Comment #22 from Peter Cordes --- Nice, that's exactly the kind of thing I suggested in bug 80571. If this covers * vsqrtss/sd (mem),%merge_into, %xmm * vpcmpeqd%same,%same, %dest# false dep on KNL / Silvermont * vcmptrueps %sam

[Bug target/88809] do not use rep-scasb for inline strlen/memchr

2019-04-09 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4

[Bug target/90568] New: stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-21 Thread peter at cordes dot ca
: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* cmp/jne is always at least as efficient as xor

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-21 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 --- Comment #1 from Peter Cordes --- https://godbolt.org/z/hHCVTc Forgot to mention, stack-protector also disables use of the red-zone for no apparent reason, so that's another missed optimization. (Perhaps rarely relevant; probably most functi

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 --- Comment #3 from Peter Cordes --- (In reply to Jakub Jelinek from comment #2) > The xor there is intentional, for security reasons we do not want the stack > canary to stay in the register afterwards, because then it could be later > spilled o

[Bug target/90568] stack protector should use cmp or sub, not xor, to allow macro-fusion on x86

2019-05-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568 --- Comment #5 from Peter Cordes --- And BTW, this only helps if the SUB and JNE are consecutive, which GCC (correctly) doesn't currently optimize for with XOR. If this sub/jne is different from a normal sub/branch and won't already get optimize

[Bug target/90582] New: AArch64 stack-protector wastes an instruction on address-generation

2019-05-22 Thread peter at cordes dot ca
-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- void protect_me() { volatile int buf[2]; buf[1] = 3; } https://godbolt.org/z/xdlr5w

[Bug target/91103] New: AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-06 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* GCC9.1 and current trunk

[Bug target/91103] AVX512 vector element extract uses more than 1 shuffle instruction; VALIGND can grab any element

2019-07-08 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103 --- Comment #4 from Peter Cordes --- We should not put any stock in what ICC does for GNU C native vector indexing. I think it doesn't know how to optimize that because it *always* spills/reloads even for `vec[0]` which could be a no-op. And it

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-07-31 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #3 from Peter Cordes --- I had another look at this with current trunk. Code-gen is similar to before with -march=skylake-avx512 -mprefer-vector-width=512. (If we improve code-gen for that choice, it will make it a win in more cases

[Bug target/82459] AVX512F instruction costs: vmovdqu8 stores may be an extra uop, and vpmovwb is 2 uops on Skylake and not always worth using

2018-07-31 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459 --- Comment #4 from Peter Cordes --- The VPAND instructions in the 256-bit version are a missed-optimization. I had another look at this with current trunk. Code-gen is similar to before with -march=skylake-avx512 -mprefer-vector-width=512. (I

[Bug libstdc++/71660] [6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2018-03-13 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 --- Comment #17 from Peter Cordes --- (In reply to Jonathan Wakely from comment #16) > But what we do care about is comment 2, i.e. _Atomic(T) and std::atomic > should have the same alignment (both in an out of structs). Maybe that needs > the C

[Bug target/85038] New: x32: unnecessary address-size prefix when a pointer register is already zero-extended

2018-03-22 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Bug 82267 was fixed for RSP only. (Or interpreted narrowly as only being

[Bug target/85038] x32: unnecessary address-size prefix when a pointer register is already zero-extended

2018-03-22 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038 --- Comment #1 from Peter Cordes --- Correction for AArch64: it supports addressing modes with a 64-bit base register + 32-bit index register with zero or sign extension for the 32-bit index. But not 32-bit base registers. As a hack that's bett

[Bug target/69576] New: tailcall could use a conditional branch on x86, but doesn't

2016-01-31 Thread peter at cordes dot ca
issed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-*, x86_64-* In x86, both jmp and jcc can use either a rel8 or

[Bug rtl-optimization/69615] New: 0 to limit signed range checks don't always use unsigned compare

2016-02-01 Thread peter at cordes dot ca
issed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- gcc sometimes misses the unsigned-compare trick for checking if a signed val

[Bug tree-optimization/69615] 0 to limit signed range checks don't always use unsigned compare

2016-02-02 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615 --- Comment #3 from Peter Cordes --- @Richard and Jakub: That's just addressing the first part of my report, the problem with x <= (INT_MAX-1), right? You may have missed the second part of the problem, since I probably buried it under too muc

[Bug target/69622] New: compiler reordering of non-temporal (write-combining) stores produces significant performance hit

2016-02-02 Thread peter at cordes dot ca
Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-linux-gnu, x86_64-linux-gnu IDK whether

[Bug tree-optimization/68557] Missed x86 peephole optimization for multiplying by a bool

2016-02-03 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68557 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #2

[Bug middle-end/51837] Use of result from 64*64->128 bit multiply via __uint128_t not optimized

2016-02-04 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51837 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1

[Bug c++/67461] Multiple atomic stores generate a StoreLoad barrier between each one, not just at the end

2016-02-04 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461 --- Comment #2 from Peter Cordes --- (In reply to Andrew Pinski from comment #1) > Hmm, I think there needs to be a barrier between each store as each store > needs to be observed by the other threads. On x86, stores are already ordered wrt. oth

[Bug tree-optimization/69908] New: recognizing idioms that check for a buffer of all-zeros could make *much* better code

2016-02-22 Thread peter at cordes dot ca
Keywords: missed-optimization, ssemmx Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Checking a block of memory to see if it's all-zero,

[Bug rtl-optimization/69933] New: non-ideal branch layout for an early-out return

2016-02-23 Thread peter at cordes dot ca
Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- (just guessing about this being an RTL bug, please reassign if it's target-specific or something else). This simple l

[Bug tree-optimization/69935] New: load not hoisted out of linked-list traversal loop

2016-02-23 Thread peter at cordes dot ca
Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- (please check the component. I guessed tree-optimization since it's cross-architecture.) gcc doesn't hois

[Bug rtl-optimization/69943] New: expressions with multiple associative operators don't always create instruction-level parallelism

2016-02-24 Thread peter at cordes dot ca
IRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- separate problems (which maybe should be separate

[Bug tree-optimization/69943] expressions with multiple associative operators don't always create instruction-level parallelism

2016-02-24 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69943 --- Comment #3 from Peter Cordes --- (In reply to ktkachov from comment #2) > On second thought, reassociating signed addition is not legal in general > because we might introduce signed overflow where one didn't exist before. In an intermediat

[Bug target/69986] New: smaller code possible with -Os by using push/pop to spill/reload

2016-02-27 Thread peter at cordes dot ca
-optimization Severity: minor Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86-64-*-* #include int f(int a) { close(a); return a; } push rbx mov

[Bug rtl-optimization/70408] New: reusing the same call-preserved register would give smaller code in some cases

2016-03-25 Thread peter at cordes dot ca
: missed-optimization Severity: enhancement Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- int foo(int); // not inlineable int bar(int a) { return foo(a+2) + 5

[Bug c/70408] reusing the same call-preserved register would give smaller code in some cases

2016-03-25 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70408 --- Comment #2 from Peter Cordes --- Should I open a separate bug for the reusing call-preserved regs thing, and retitle this one to the call-reordering issue we ended up talking about here? I always have a hard time limiting an optimization bug

[Bug c++/71245] New: std::atomic load/store bounces the data to the stack using fild/fistp

2016-05-23 Thread peter at cordes dot ca
-optimization Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-linux-gnu Same result with gcc 4.8, gcc5, and gcc6.1. Didn't

[Bug target/71321] New: [6 regression] x86: worse code for uint8_t % 10 and / 10

2016-05-27 Thread peter at cordes dot ca
Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: i386-linux-gnu, x86_64-linux-gnu If we have an integer (0..99), we can modulo and

[Bug target/71245] std::atomic load/store bounces the data to the stack using fild/fistp

2016-05-27 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245 --- Comment #3 from Peter Cordes --- (In reply to Uroš Bizjak from comment #2) > Recently x86 linux changed the barrier to what you propose. If it is worth, > we can change it without any problems. I guess it costs a code byte for a disp8 in the

[Bug rtl-optimization/59511] [4.9 Regression] FAIL: gcc.target/i386/pr36222-1.c scan-assembler-not movdqa with -mtune=corei7

2016-06-02 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59511 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #6

[Bug rtl-optimization/59511] [4.9 Regression] FAIL: gcc.target/i386/pr36222-1.c scan-assembler-not movdqa with -mtune=corei7

2016-06-02 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59511 --- Comment #7 from Peter Cordes --- I'm seeing the same symptom, affecting gcc4.9 through 5.3. Not present in 6.1. IDK if the cause is the same. (code from an improvement to the horizontal_add functions in Agner Fog's vector class library) #

[Bug target/80837] [7/8 regression] x86 accessing a member of a 16-byte atomic object generates terrible code: splitting/merging the bytes

2017-08-20 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837 --- Comment #3 from Peter Cordes --- Seems to be fixed in gcc7.2.0: https://godbolt.org/g/jRwtZN gcc7.2 is fine with -m32, -mx32, and -m64, but x32 is the most compact. -m64 just calls __atomic_load_16 gcc7.2 -O3 -mx32 output: follow_nounion(

[Bug inline-asm/82001] New: [5/6/7/8 regression] wrong code when two functions differ only in inline asm register constraints

2017-08-27 Thread peter at cordes dot ca
Keywords: wrong-code Severity: normal Priority: P3 Component: inline-asm Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* When a single compilation unit

[Bug target/53687] _mm_cmpistri generates redundant movslq %ecx,%rcx on x86-64

2017-09-02 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53687 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #1

[Bug target/65146] alignment of _Atomic structure member is not correct

2017-09-05 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146 Peter Cordes changed: What|Removed |Added CC||peter at cordes dot ca --- Comment #4

[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2017-09-05 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 --- Comment #7 from Peter Cordes --- C++11 std::atomic<> is correct, and the change was necessary. 8B alignment is required for 8B objects to be efficiently lock-free (using SSE load / store for .load() and .store(), see https://stackoverflow.co

[Bug target/65146] alignment of _Atomic structure member is not correct

2017-09-05 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146 --- Comment #6 from Peter Cordes --- My test-case on godbolt: https://godbolt.org/g/MmLycw. gcc8 snapshot still only has 4B alignment Fun fact: clang4.0 -m32 inlines lock cmpxchg8b for 8-byte atomic load/store. This is ironic, because it *does

[Bug libstdc++/71660] [5/6/7/8 regression] alignment of std::atomic<8 byte primitive type> (long long, double) is wrong on x86

2017-09-05 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660 --- Comment #11 from Peter Cordes --- (In reply to Thiago Macieira from comment #10) > Actually, PR 65146 points out that the problem is not efficiency but > correctness. An under-aligned type could cross a cacheline boundary and thus > fail to b

[Bug target/65146] alignment of _Atomic structure member is not correct

2017-09-05 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146 --- Comment #8 from Peter Cordes --- BTW, all of my proposals are really ABI changes, even if struct layout stays the same. All code has to agree on which objects are lock-free or not, and whether they need to check alignment before using an SSE

[Bug tree-optimization/82135] New: Missed constant propagation through possible unsigned wraparound, with std::align() variable pointer, constant everything else.

2017-09-07 Thread peter at cordes dot ca
Version: 8.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- The

[Bug target/80568] x86 -mavx256-split-unaligned-load (and store) is affecting AVX2 code, but probably shouldn't be.

2017-09-07 Thread peter at cordes dot ca
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568 Peter Cordes changed: What|Removed |Added Status|UNCONFIRMED |RESOLVED Resolution|---

  1   2   3   >