--- Comment #8 from peter at cordes dot ca 2009-02-12 17:56 ---
Would it cause any problems for g++ to behave more like a C compiler when it
comes to NULL? e.g. I found this bug report after finding that kscope 1.9.1
didn't compile, because it expected NULL to match the void* ve
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=39942
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #53
Version: 10.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243
--- Comment #1 from Peter Cordes ---
Forgot to mention, this probably applies to other ISAs with GP-integer
byte-reverse instructions and efficient unaligned loads.
-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
We get a redundant instruction inside the vectorized loop here. But it's
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244
--- Comment #1 from Peter Cordes ---
On AArch64 (with gcc8.2), we see a similar effect, more instructions in the
loop. And an indexed addressing mode.
https://godbolt.org/z/6ZVWY_
# strrev_explicit -O3 -mcpu=cortex-a53
...
.L4:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244
Peter Cordes changed:
What|Removed |Added
Summary|extra sub inside vectorized |vectorized loop updating 2
: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
typedef short
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92246
--- Comment #1 from Peter Cordes ---
And BTW, GCC *does* use vpermd (not vpermt2d) for swapt = int or long. This
problem only applies to char and short. Possibly because AVX2 includes vpermd
ymm.
Apparently CannonLake has 1 uop vpermb bu
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92244
--- Comment #4 from Peter Cordes ---
(In reply to Andrew Pinski from comment #3)
> (In reply to Peter Cordes from comment #1)
> > On AArch64 (with gcc8.2), we see a similar effect, more instructions in the
> > loop. And an indexed addressing mod
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459
Peter Cordes changed:
What|Removed |Added
See Also||https://gcc.gnu.org/bugzill
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89346
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40838
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #91
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93141
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89063
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #1
ywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
float cvt(double unused, double xmm1) { return xmm1; }
g++ (GCC-Explorer-Build)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80586
Peter Cordes changed:
What|Removed |Added
Status|UNCONFIRMED |RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #2 from Peter Cordes ---
(In reply to H.J. Lu from comment #1)
> But
>
> vxorps %xmm0, %xmm0, %xmm0
> vcvtsd2ss %xmm1, %xmm0, %xmm0
>
> are faster than both.
On Skylake-client (i7-6700k), I can't reproduce this r
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #3 from Peter Cordes ---
(In reply to H.J. Lu from comment #1)
I have a patch for PR 87007:
>
> https://gcc.gnu.org/ml/gcc-patches/2019-01/msg00298.html
>
> which inserts a vxorps at the last possible position. vxorps
> will be exe
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #5 from Peter Cordes ---
(In reply to H.J. Lu from comment #4)
> (In reply to Peter Cordes from comment #2)
> > Can you show some
> > asm where this performs better?
>
> Please try cvtsd2ss branch at:
>
> https://github.com/hjl-to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #6 from Peter Cordes ---
(In reply to Peter Cordes from comment #5)
> But whatever the effect is, it's totally unrelated to what you were *trying*
> to test. :/
After adding a `ret` to each AVX function, all 5 are basically the same
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #8 from Peter Cordes ---
Created attachment 45544
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=45544&action=edit
testloop-cvtss2sd.asm
(In reply to H.J. Lu from comment #7)
> I fixed assembly codes and run it on different AVX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #10 from Peter Cordes ---
(In reply to Uroš Bizjak from comment #9)
> There was similar patch for sqrt [1], I think that the approach is
> straightforward, and could be applied to other reg->reg scalar insns as
> well, independently o
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494
--- Comment #4 from Peter Cordes ---
I suspect dep-chains are the problem, and branching to skip work is a Good
Thing when it's predictable.
(In reply to Richard Biener from comment #2)
> On Skylake it's better (1uop, 1 cycle latency) while on R
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494
--- Comment #5 from Peter Cordes ---
IF ( xij.GT.+HALf ) xij = xij - PBCx
IF ( xij.LT.-HALf ) xij = xij + PBCx
For code like this, *if we can prove only one of the IF() conditions will be
true*, we can implement it
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88494
--- Comment #6 from Peter Cordes ---
Oops, these were SD not SS. Getting sleepy >.<. Still, my optimization
suggestion for doing both compares in one masked SUB of +-PBCx applies equally.
And I think my testing with VBLENDVPS should apply equa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #15 from Peter Cordes ---
(In reply to Uroš Bizjak from comment #13)
> I assume that memory inputs are not problematic for SSE/AVX {R,}SQRT, RCP
> and ROUND instructions. Contrary to CVTSI2S{S,D}, CVTSS2SD and CVTSD2SS, we
> currently
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
From
https
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69560
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #23
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81274
--- Comment #2 from Peter Cordes ---
The stray LEA bug seems to be fixed in current trunk (9.0.0 20180429), at least
for this testcase. Gcc's stack-alignment strategy seems to be improved overall
(not copying the return address when not needed),
Reporter: peter at cordes dot ca
Target Milestone: ---
Bug 84011 shows some really silly code-gen for PIC code and discussion
suggested using a table of offsets instead of a table of actual pointers, so
you just need one base address.
A further optimization is possible when the strings are
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85585
--- Comment #1 from Peter Cordes ---
By comparison, the no-PIE table of pointers only needs one instruction:
movqCSWTCH.4(,%rdi,8), %rax
So all my suggestions cost 1 extra instruction on x86 in no-PIE mode, but at a
massive savings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011
--- Comment #12 from Peter Cordes ---
(In reply to Jakub Jelinek from comment #10)
> (In reply to Peter Cordes from comment #9)
> > gcc already totally misses optimizations here where one string is a suffix
> > of another. "mii" could just be a
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84011
--- Comment #13 from Peter Cordes ---
(In reply to Jakub Jelinek from comment #10)
> ?? That is the task for the linker SHF_MERGE|SHF_STRINGS handling.
> Why should gcc duplicate that?
Because gcc would benefit from knowing if merging makes the
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615
--- Comment #5 from Peter Cordes ---
Update: https://godbolt.org/g/ZQDY1G
gcc7/8 optimizes this to and / cmp / jb, while gcc6.3 doesn't.
void rangecheck_var(int64_t x, int64_t lim2) {
//lim2 >>= 60;
lim2 &= 0xf; // let the compiler figure
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833
--- Comment #14 from Peter Cordes ---
I happened to look at this old bug again recently.
re: extracting high the low two 32-bit elements:
(In reply to Uroš Bizjak from comment #11)
> > Or without SSE4 -mtune=sandybridge (anything that excluded
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820
--- Comment #5 from Peter Cordes ---
AVX512F with marge-masking for integer->vector broadcasts give us a single-uop
replacement for vpinsrq/d, which is 2 uops on Intel/AMD.
See my answer on
https://stackoverflow.com/questions/50779309/loading-an
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
The wrong-code bug 86314 also
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91398
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91515
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #4
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
As a workaround for PR 82887 some code (e.g. a memset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82887
--- Comment #5 from Peter Cordes ---
Reported bug 92080 for the missed CSE
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837
--- Comment #5 from Peter Cordes ---
(In reply to Jakub Jelinek from comment #4)
> Can't reproduce. It is true that we now emit the __atomic_load_16 call, but
> that was intentional change
Yup.
>, and it can't be easily tail call, because the
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837
--- Comment #6 from Peter Cordes ---
(In reply to Jakub Jelinek from comment #4)
> But have just tried gcc 7.1.0 release and can't reproduce even there.
Matt says the Compiler Explorer backend uses upstream release tarballs like
`URL=ftp://ftp.g
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
Bug 53947 depends on bug 80846, which changed state.
Bug 80846 Summary: auto-vectorized AVX2 horizontal sum should narrow to 128b
right away, to be more efficient for Ryzen and Intel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
Peter Cordes changed:
What|Removed |Added
Status|RESOLVED|REOPENED
Resolution|FIXED
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
--- Comment #22 from Peter Cordes ---
Forgot the Godbolt link with updated cmdline options:
https://godbolt.org/g/FCZAEj.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
--- Comment #25 from Peter Cordes ---
We're getting a spill/reload inside the loop with AVX512:
.L2:
vmovdqa64 (%esp), %zmm3
vpaddd (%eax), %zmm3, %zmm2
addl$64, %eax
vmovdqa64 %zmm2, (%esp)
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
--- Comment #28 from Peter Cordes ---
(In reply to Richard Biener from comment #27)
> Note that this is deliberately left as-is because the target advertises
> (cheap) support for horizontal reduction. The vectorizer simply generates
> a single
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #3
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=38959
--- Comment #4 from Peter Cordes ---
The __builtin_ia32_rdpmc being a pure function bug I mentioned in my previous
comment is already reported and fixed (in gcc9 only): bug 87550
It was present since at least gcc 5.0
https://software.intel.com/e
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571
--- Comment #2 from Peter Cordes ---
I think hjl's patch for PR 89071 / PR 87007 fixes (most of?) this, at least for
AVX.
If register pressure is an issue, using a reg holding a arbitrary constant
(instead of xor-zeroed) is a valid option, as th
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89071
--- Comment #22 from Peter Cordes ---
Nice, that's exactly the kind of thing I suggested in bug 80571. If this
covers
* vsqrtss/sd (mem),%merge_into, %xmm
* vpcmpeqd%same,%same, %dest# false dep on KNL / Silvermont
* vcmptrueps %sam
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88809
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #4
: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
cmp/jne is always at least as efficient as xor
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568
--- Comment #1 from Peter Cordes ---
https://godbolt.org/z/hHCVTc
Forgot to mention, stack-protector also disables use of the red-zone for no
apparent reason, so that's another missed optimization. (Perhaps rarely
relevant; probably most functi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568
--- Comment #3 from Peter Cordes ---
(In reply to Jakub Jelinek from comment #2)
> The xor there is intentional, for security reasons we do not want the stack
> canary to stay in the register afterwards, because then it could be later
> spilled o
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90568
--- Comment #5 from Peter Cordes ---
And BTW, this only helps if the SUB and JNE are consecutive, which GCC
(correctly) doesn't currently optimize for with XOR.
If this sub/jne is different from a normal sub/branch and won't already get
optimize
-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
void protect_me() {
volatile int buf[2];
buf[1] = 3;
}
https://godbolt.org/z/xdlr5w
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
GCC9.1 and current trunk
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103
--- Comment #4 from Peter Cordes ---
We should not put any stock in what ICC does for GNU C native vector indexing.
I think it doesn't know how to optimize that because it *always* spills/reloads
even for `vec[0]` which could be a no-op. And it
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459
--- Comment #3 from Peter Cordes ---
I had another look at this with current trunk. Code-gen is similar to before
with -march=skylake-avx512 -mprefer-vector-width=512. (If we improve code-gen
for that choice, it will make it a win in more cases
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459
--- Comment #4 from Peter Cordes ---
The VPAND instructions in the 256-bit version are a missed-optimization.
I had another look at this with current trunk. Code-gen is similar to before
with -march=skylake-avx512 -mprefer-vector-width=512. (I
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660
--- Comment #17 from Peter Cordes ---
(In reply to Jonathan Wakely from comment #16)
> But what we do care about is comment 2, i.e. _Atomic(T) and std::atomic
> should have the same alignment (both in an out of structs). Maybe that needs
> the C
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Bug 82267 was fixed for RSP only. (Or interpreted narrowly as only being
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85038
--- Comment #1 from Peter Cordes ---
Correction for AArch64: it supports addressing modes with a 64-bit base
register + 32-bit index register with zero or sign extension for the 32-bit
index. But not 32-bit base registers.
As a hack that's bett
issed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: i386-*, x86_64-*
In x86, both jmp and jcc can use either a rel8 or
issed-optimization
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
gcc sometimes misses the unsigned-compare trick for checking if a signed val
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69615
--- Comment #3 from Peter Cordes ---
@Richard and Jakub:
That's just addressing the first part of my report, the problem with x <=
(INT_MAX-1), right?
You may have missed the second part of the problem, since I probably buried it
under too muc
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: i386-linux-gnu, x86_64-linux-gnu
IDK whether
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68557
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=51837
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67461
--- Comment #2 from Peter Cordes ---
(In reply to Andrew Pinski from comment #1)
> Hmm, I think there needs to be a barrier between each store as each store
> needs to be observed by the other threads.
On x86, stores are already ordered wrt. oth
Keywords: missed-optimization, ssemmx
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Checking a block of memory to see if it's all-zero,
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
(just guessing about this being an RTL bug, please reassign if it's
target-specific or something else).
This simple l
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
(please check the component. I guessed tree-optimization since it's
cross-architecture.)
gcc doesn't hois
IRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
separate problems (which maybe should be separate
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=69943
--- Comment #3 from Peter Cordes ---
(In reply to ktkachov from comment #2)
> On second thought, reassociating signed addition is not legal in general
> because we might introduce signed overflow where one didn't exist before.
In an intermediat
-optimization
Severity: minor
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86-64-*-*
#include
int f(int a) { close(a); return a; }
push rbx
mov
: missed-optimization
Severity: enhancement
Priority: P3
Component: rtl-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
int foo(int); // not inlineable
int bar(int a) {
return foo(a+2) + 5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70408
--- Comment #2 from Peter Cordes ---
Should I open a separate bug for the reusing call-preserved regs thing, and
retitle this one to the call-reordering issue we ended up talking about here?
I always have a hard time limiting an optimization bug
-optimization
Severity: normal
Priority: P3
Component: c++
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: i386-linux-gnu
Same result with gcc 4.8, gcc5, and gcc6.1. Didn't
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: i386-linux-gnu, x86_64-linux-gnu
If we have an integer (0..99), we can modulo and
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71245
--- Comment #3 from Peter Cordes ---
(In reply to Uroš Bizjak from comment #2)
> Recently x86 linux changed the barrier to what you propose. If it is worth,
> we can change it without any problems.
I guess it costs a code byte for a disp8 in the
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59511
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59511
--- Comment #7 from Peter Cordes ---
I'm seeing the same symptom, affecting gcc4.9 through 5.3. Not present in 6.1.
IDK if the cause is the same.
(code from an improvement to the horizontal_add functions in Agner Fog's vector
class library)
#
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837
--- Comment #3 from Peter Cordes ---
Seems to be fixed in gcc7.2.0: https://godbolt.org/g/jRwtZN
gcc7.2 is fine with -m32, -mx32, and -m64, but x32 is the most compact. -m64
just calls __atomic_load_16
gcc7.2 -O3 -mx32 output:
follow_nounion(
Keywords: wrong-code
Severity: normal
Priority: P3
Component: inline-asm
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
When a single compilation unit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53687
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #4
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660
--- Comment #7 from Peter Cordes ---
C++11 std::atomic<> is correct, and the change was necessary.
8B alignment is required for 8B objects to be efficiently lock-free (using SSE
load / store for .load() and .store(), see
https://stackoverflow.co
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146
--- Comment #6 from Peter Cordes ---
My test-case on godbolt: https://godbolt.org/g/MmLycw. gcc8 snapshot still
only has 4B alignment
Fun fact: clang4.0 -m32 inlines lock cmpxchg8b for 8-byte atomic load/store.
This is ironic, because it *does
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660
--- Comment #11 from Peter Cordes ---
(In reply to Thiago Macieira from comment #10)
> Actually, PR 65146 points out that the problem is not efficiency but
> correctness. An under-aligned type could cross a cacheline boundary and thus
> fail to b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146
--- Comment #8 from Peter Cordes ---
BTW, all of my proposals are really ABI changes, even if struct layout stays
the same.
All code has to agree on which objects are lock-free or not, and whether they
need to check alignment before using an SSE
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
The
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568
Peter Cordes changed:
What|Removed |Added
Status|UNCONFIRMED |RESOLVED
Resolution|---
1 - 100 of 268 matches
Mail list logo