https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78762
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #16
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
static const int aligned = 0
-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
(same code as bug 82136, but with aligned pointers, and discussing the overall
vectorization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136
--- Comment #1 from Peter Cordes ---
Whoops, the compiler-explorer link had aligned=1. This one produces the asm I
showed in the original report: https://godbolt.org/g/WsZ5S9
See bug 82137 for a much more efficient vectorization strategy gcc sh
: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
#include
#include
// stripped down from a real
: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
(not sure if tree-optimization is the right "component", pl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82135
--- Comment #2 from Peter Cordes ---
(In reply to Marc Glisse from comment #1)
> This PR is a bit messy, please minimize your examples...
Sorry, looking at it again later I could have done better. I thought it was
somewhat relevant that this wa
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67458
--- Comment #5 from Peter Cordes ---
> optabs: ensure atomic_load/stores have compiler barriers
Thanks for taking a look at this report. But I think it's not necessary to
have a full 2-way barrier. If there's a lighter-weight way to get the
be
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660
--- Comment #13 from Peter Cordes ---
(In reply to Thiago Macieira from comment #12)
> Another problem is that we've now had a couple of years with this issue, so
> it's probably worse to make a change again.
A change to C++11 std::atomic? Yeah
https://godbolt.org/g/GhW4b8
Status: UNCONFIRMED
Keywords: wrong-code
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158
--- Comment #1 from Peter Cordes ---
Related: bug 55747 describes why gcc keeps the `push {r4, lr}` in the _Noreturn
function: backtraces.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82137
--- Comment #2 from Peter Cordes ---
(In reply to Richard Biener from comment #1)
> Interesting idea. It's probably a bit hard to make the vectorizer do this
> though given it's current structure and the fact that it would have to
> cost the ext
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82136
--- Comment #3 from Peter Cordes ---
(In reply to Richard Biener from comment #2)
> And it gets worse because of the splitting
> which isn't exposed to the vectorizer.
Split loads/stores can be a useful shuffling strategy even on Haswell/Skylake
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158
--- Comment #3 from Peter Cordes ---
(In reply to jos...@codesourcery.com from comment #2)
> Falling off a noreturn function sounds like it could be another case to
> insert __builtin_trap (), as we do in various cases of undefined behavior.
gc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158
--- Comment #5 from Peter Cordes ---
(In reply to Ramana Radhakrishnan from comment #4)
> It's a "feature" - if the function really doesn't return, then there is no
> real requirement to save and restore all callee-saved registers.
>
> A deliber
-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: arm*-*-*
int ext();
int tailcall_external() { return ext(); }
// https
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71725
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #1
t: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=47769
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #6
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
bool bt_signed(int x, unsigned bit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259
--- Comment #1 from Peter Cordes ---
More generally, you can flip a higher bit while copying with
lea 64(%rdi), %eax
That leaves the bits above that position munged by carry-out, but that isn't
always a problem.
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
int shift(int x, int c) {
return
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259
--- Comment #3 from Peter Cordes ---
Oops, BT sets CF, not ZF. So
bt $13, %edi
setnc %al# aka setae
ret
This is what clang does for the bt_ functions, and might be optimal for many
use-cases. (For br
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
unsigned shld(unsigned a, unsigned b, unsigned n){
//n=13;
a <<= n;
b >
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82259
--- Comment #4 from Peter Cordes ---
(In reply to Uroš Bizjak from comment #2)
> A couple of *scc_bt patterns are missing. These are similar to already
> existing *jcc_bt patterns. Combine wants:
Does gcc also need patterns for bt + cmovcc?
Thi
IRMED
Keywords: ABI, missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*
x32 defaults to using 32-bit address
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82260
--- Comment #4 from Peter Cordes ---
(In reply to Jakub Jelinek from comment #2)
> From pure instruction size POV, for the first 2 alternatives as can be seen
> say on:
> ...
> movb $0x15, %al
> movl $0x15, %eax
> movb $-0x78, %bl
> movl $-0x78,
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82260
--- Comment #5 from Peter Cordes ---
> (not (match_test "TARGET_PARTIAL_REG_STALL"))
gcc is doing this even with -mtune=core2.
Core2 / Nehalem stall (the front-end) for 2-3 cycles to insert a merging uop
when reading a full register after w
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158
Peter Cordes changed:
What|Removed |Added
Status|RESOLVED|VERIFIED
--- Comment #7 from Peter Cordes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158
Peter Cordes changed:
What|Removed |Added
Status|VERIFIED|UNCONFIRMED
Resolution|WONTFIX
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158
--- Comment #10 from Peter Cordes ---
(In reply to Jakub Jelinek from comment #9)
> None of the above options is IMHO acceptable.
> This is UB like any other.
I still think it's a quality-of-implementation bug that could be fixed without
downsid
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #2
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=81602
--- Comment #3 from Peter Cordes ---
Forgot to mention: memory-source popcnt with an indexed addressing mode would
also be worse on SnB/IvB: it can't stay micro-fused, so the front-end
un-laminates it in the issue stage.
Haswell and later can ke
Keywords: missed-optimization, ssemmx
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
long long test_and(long long x) {
return x & 0x77fff
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
gcc never seems to emit BZHI on its own.
// exact BZHI behaviour for all inputs (with no C UB)
unsigned
-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
#include
unsigned long long use_intrinsic(void
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82158
Peter Cordes changed:
What|Removed |Added
Resolution|INVALID |WONTFIX
--- Comment #11 from Peter Cordes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82267
--- Comment #6 from Peter Cordes ---
(In reply to H.J. Lu from comment #2)
> > Are there still cases where -maddress-mode=long makes worse code?
>
>
> Yes, there are more places where -maddress-mode=long needs to zero-extend
> address to 64 bit
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68924
--- Comment #1 from Peter Cordes ---
There's __m128i _mm_loadl_epi64 (__m128i const*
mem_addr)(https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=movq&expand=5450,4247,3115&techs=SSE2),
which gcc makes available in 32-bit mode.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339
--- Comment #4 from Peter Cordes ---
(In reply to Jakub Jelinek from comment #0)
> At least on i7-5960X in the following testcase:
>
> baz is fastest as well as shortest.
> So I think we should consider using movl $cst, %edx; shlq $shift, %rdx
>
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82339
--- Comment #5 from Peter Cordes ---
(In reply to Richard Biener from comment #2)
> I always wondered if it is more efficient to have constant pools per function
> in .text so we can do %rip relative loads with short displacement?
There's no rel
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68924
--- Comment #3 from Peter Cordes ---
(In reply to Marc Glisse from comment #2)
> Does anything bad happen if you remove the #ifdef/#endif for
> _mm_cvtsi64_si128? (2 files in the testsuite would need updating for a
> proper patch)
It's just a wr
IRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Created attachment 41285
--> https://gcc.gnu.org/
gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Tar
gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Tar
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80571
--- Comment #1 from Peter Cordes ---
Tracking "cold" registers that are safe to use as a read-only source (whether
or not they're holding a useful value like a constant) has a couple other
applications for x86:
* vcvtsi2ss/sd %src,%merge_into,
-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
#include
float sqrt_depcheck(float a, float b
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568
--- Comment #2 from Peter Cordes ---
Using ISA-extension options removes some microarchitectures from the set of
CPUs that can run the code, so it would be appropriate for them to have some
effect on tuning.
A "generic AVX2 CPU" is much more spe
: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
Currently, gcc compiles _mm256_setzero_ps() to vxorps %ymm0, %ymm0, %ymm0, or
zmm for _mm512_setzero_ps. And similar for pd and integer vectors, using a
vector
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636
--- Comment #2 from Peter Cordes ---
> The same possibly applies to all "zero-extending" moves?
Yes, if a vmovdqa %xmm0,%xmm1 will work, it's the best choice on AMD CPUs,
and doesn't hurt on Intel CPUs. So in any case where you need to copy a
: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
This actually applies to all cases of testing
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
#include
__m128i combine64(long long
: gcc
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
T
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80819
--- Comment #2 from Peter Cordes ---
(In reply to Andrew Pinski from comment #1)
> >-mtune=generic still stores/reloads instead of using movd for %edi and %edx,
> >which is worse for most CPUs.
> Worse on most Intel but not most AMD CPUs. You
ion
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: i?86-*-*
This affects 64-bit atomic loads/stores, as well as _mm_set_epi
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833
--- Comment #1 from Peter Cordes ---
See https://godbolt.org/g/krXH9M for the functions I was looking at.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820
--- Comment #2 from Peter Cordes ---
See also https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833.
gcc -m32 does an even worse job of getting int64_t into an xmm reg, e.g. as
part of a 64-bit atomic store.
We get a store-forwarding failure from
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80820
--- Comment #3 from Peter Cordes ---
Also, going the other direction is not symmetric. On some CPUs, a store/reload
strategy for xmm->int might be better even if an ALU strategy for int->xmm is
best.
Also, the choice can depend on chunk size, s
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833
--- Comment #2 from Peter Cordes ---
On most CPUs, psrldq / movd is optimal for xmm[1] -> int without SSE4. On
SnB-family, movd runs on port0, and psrldq can run on port5, so they can
execute in parallel. (And the second movd can run the next c
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833
--- Comment #3 from Peter Cordes ---
Atom's movd xmm->int is slower (lat=4, rtput=2) than its movd int->xmm (lat=3,
rtput=1), which is opposite of every other CPU (except Silvermont where they're
the same throughput but xmm->int is 1c slower). S
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833
--- Comment #4 from Peter Cordes ---
I don't think it's worth anyone's time to implement this in 2017, but using MMX
regs for 64-bit store/load would be faster on really old CPUs that split 128b
vectors insns into two halves, like K8 and Pentium
: missed-optimization
Severity: normal
Priority: P3
Component: libstdc++
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
For std::atomic or similar small struct, accessing just
one member with foo.load().m
Version: 7.1.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636
--- Comment #3 from Peter Cordes ---
The point about moves also applies to integer code, since a 64-bit mov requires
an extra byte for the REX prefix (unless a REX prefix was already required for
r8-r15).
I just noticed a case where gcc uses a 6
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=70490
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #5
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835
Peter Cordes changed:
What|Removed |Added
See Also||https://gcc.gnu.org/bugzill
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71660
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #5
IRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
float sumfloat_omp(const float arr[]) {
float sum=0;
#pragma omp
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835
--- Comment #3 from Peter Cordes ---
(In reply to Jonathan Wakely from comment #2)
> You've reported this against libstdc++
I had to take a guess at the right component, based on a couple other
std::atomic bugs I looked at. Apparently I picked
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835
--- Comment #4 from Peter Cordes ---
Thanks for correcting my mistake in tagging this bug, but this got me thinking
it's not just a C++ issue.
This also applies to GNU C __atomic_load_n(), and ISO C11 stdatomic code like
#include
#include
uin
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80833
--- Comment #6 from Peter Cordes ---
(In reply to Richard Biener from comment #5)
> There's some related bugs. I think there is no part of the compiler that
> specifically tries to avoid store forwarding issues.
Ideally the compiler would keep
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80835
--- Comment #5 from Peter Cordes ---
Previous godbolt link was supposed to be: https://godbolt.org/g/78kIAl
which includes the CAS functions.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80837
--- Comment #2 from Peter Cordes ---
(In reply to Richard Biener from comment #1)
> GCC 8 generates a __atomic_load_16 call for me while GCC 6 does
>
> lock cmpxchg16b (%rdi)
That's expected. See https://gcc.gnu.org/ml/gcc-patches/2017
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80844
--- Comment #3 from Peter Cordes ---
(In reply to Jakub Jelinek from comment #2)
> It doesn't always zero, it can be pretty arbitrary.
Is if feasible have it just load the first vector of elements, instead of
broadcasting the identity value? i.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80846
--- Comment #2 from Peter Cordes ---
(In reply to Richard Biener from comment #1)
> That is, it was supposed to end up using pslldq
I think you mean PSRLDQ. Byte zero is the right-most when drawn in a way that
makes bit/byte shift directions al
: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86-64-*-*
-mtune=generic and -mtune=intel currently don't opt
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
Full version:
http://stackoverflow.com/questions/41323911/why-the-difference-in-code
sed-optimization, ssemmx
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
#include
void pack_high8_basel
ds: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
gcc defeats this attempt to get it to reduce the
words: missed-optimization, ssemmx
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
#include
#include
#include
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370
--- Comment #2 from Peter Cordes ---
(In reply to Jakub Jelinek from comment #1)
> Created attachment 42296 [details]
> gcc8-pr82370.patch
>
> If VPAND is exactly as fast as VPANDQ except for different encodings, then
> maybe we can do something
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370
--- Comment #3 from Peter Cordes ---
Doesn't change the performance implications, but I just realized I have the
offset-load backwards. Instead of
vpsrlw $8, (%rsi), %xmm1
vpand 15(%rsi), %xmm2, %xmm0
this algorithm should us
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370
--- Comment #4 from Peter Cordes ---
VPANDQ can be shorter than an equivalent VPAND, for displacements > 127 but <=
16 * 127 or 32 * 127, and that are an exact multiple of the vector width. EVEX
with disp8 always implies a compressed displacemen
: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
static __attribute((noinline))
int get_constant() { /* optionally stuff
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82432
--- Comment #1 from Peter Cordes ---
Meant to add https://godbolt.org/g/K9CxQ6 before submitting. And to say I
wasn't sure tree-optimization was the right component.
I did check that -flto didn't do this optimization either.
Is it worth openin
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82370
--- Comment #5 from Peter Cordes ---
I got off topic with this bug. It was supposed to be about emitting
vpsrlw $8, (%rsi), %xmm1# load folded into AVX512BW version
instead of
vmovdqu64 (%rsi), %xmm0 # or VEX vmovdqu;
Version: 8.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
gcc bottlenecks on
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization, ssemmx
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82459
--- Comment #1 from Peter Cordes ---
BTW, if we *are* using vpmovwb, it supports a memory operand. It doesn't save
any front-end uops on Skylake-avx512, just code-size. Unless it means less
efficient packing in the uop cache (since all uops fro
Version: 8.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone
MED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
long long sumarray(const
sed-optimization, ssemmx
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
long long sumarray(const int *data)
{
data = (const int*)__builtin_assume_alig
: missed-optimization
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*
struct twoint {
int a, b;
};
int bar(struct
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82680
--- Comment #2 from Peter Cordes ---
gcc's sequence is *probably* good, as long as it uses xor / comisd / setcc and
not comisd / setcc / movzx (which gcc often likes to do for integer setcc).
(u)comisd and cmpeqsd both run on the FP add unit. A
us: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
#include
#include
#include
void p128_as_u8hex(__m128i in) {
_Alignas(16
at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
#include "immintrin.h"
#include "inttypes.h"
__m256i gather(char *array, uint16_t *offset) {
return _mm256_set_epi8(array[offset[0]], array[offset[1]], array[offset[2]],
array[offset[3]], arr
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
#include
#include
int *foo(unsigned size)
{
int *p
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82729
--- Comment #2 from Peter Cordes ---
(In reply to Richard Biener from comment #1)
> The issue is we have no merging of stores at the RTL level and the GIMPLE
> level doesn't know whether the variables will end up allocated next to each
> other.
101 - 200 of 268 matches
Mail list logo