https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110619
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #7 fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108441
--- Comment #4 from Peter Cordes ---
This is already fixed in current trunk; sorry I forgot to check that before
recommending to report this store-coalescing bug.
# https://godbolt.org/z/j3MdWrcWM
# GCC nightly -O3 (tune=generic) and GCC11
s
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688
--- Comment #27 from Peter Cordes ---
(In reply to Alexander Monakov from comment #26)
> Sure, the right course of action seems to be to simply document that atomic
> types and built-ins are meant to be used on "common" (writeback) memory
Agree
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688
--- Comment #25 from Peter Cordes ---
(In reply to Alexander Monakov from comment #24)
>
> I think it's possible to get UC/WC mappings via a graphics/compute API (e.g.
> OpenGL, Vulkan, OpenCL, CUDA) on any OS if you get a mapping to device
> m
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104688
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #23 f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106138
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #3 fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105929
Bug ID: 105929
Summary: [AArch64] armv8.4-a allows atomic stp. 64-bit
constants can use 2 32-bit halves with _Atomic or
volatile
Product: gcc
Version: 13.0
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105928
Bug ID: 105928
Summary: [AArch64] 64-bit constants with same high/low halves
can use ADD lsl 32 (-Os at least)
Product: gcc
Version: 13.0
Status: UNCONFIRMED
K
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105904
Bug ID: 105904
Summary: Predicated mov r0, #1 with opposite conditions could
be hoisted, between 1 and 1< // using the libstdc++ header
unsigned roundup(unsigned x){
return std::bit_ceil(x);
}
http
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596
--- Comment #1 from Peter Cordes ---
https://godbolt.org/z/aoG55T5Yq
gcc -O3 -m32 has the same problem with unsigned long long total and unsigned
i.
Pretty much identical instruction sequences in the loop for all 3 versions,
doing add/adc to
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105596
Bug ID: 105596
Summary: Loop counter widened to 128-bit unnecessarily
Product: gcc
Version: 13.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65146
--- Comment #25 from Peter Cordes ---
(In reply to CVS Commits from comment #24)
> The master branch has been updated by Jakub Jelinek :
>
> https://gcc.gnu.org/g:04df5e7de2f3dd652a9cddc1c9adfbdf45947ae6
>
> commit r11-2909-g04df5e7de2f3dd652a9
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82261
--- Comment #4 from Peter Cordes ---
GCC will emit SHLD / SHRD as part of shifting an integer that's two registers
wide.
Hironori Bono proposed the following functions as a workaround for this missed
optimization (https://stackoverflow.com/a/7180
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066
--- Comment #5 from Peter Cordes ---
> pextrw requires sse4.1 for mem operands.
You're right! I didn't double-check the asm manual for PEXTRW when writing up
the initial report, and had never realized that PINSRW wasn't symmetric with
it.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105079
Bug ID: 105079
Summary: _mm_storeu_si16 inefficiently uses pextrw to an
integer reg (without SSE4.1)
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Keywords: m
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508
--- Comment #17 from Peter Cordes ---
(In reply to Andrew Pinski from comment #16)
> >According to Intel (
> > https://software.intel.com/sites/landingpage/IntrinsicsGuide), there are no
> > alignment requirements for _mm_load_sd, _mm_store_sd an
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84508
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #14 fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754
--- Comment #6 from Peter Cordes ---
Looks good to me, thanks for taking care of this quickly, hopefully we can get
this backported to the GCC11 series to limit the damage for people using these
newish intrinsics. I'd love to recommend them for
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105066
Bug ID: 105066
Summary: GCC thinks pinsrw xmm, mem, 0 requires SSE4.1, not
SSE2? _mm_loadu_si16 bounces through integer reg
Product: gcc
Version: 12.0
Status: UNCONFIRM
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754
--- Comment #3 from Peter Cordes ---
Wait a minute, the current implementation of _mm_loadu_si32 isn't
strict-aliasing or alignment safe!!! That defeats the purpose for its
existence as something to use instead of _mm_cvtsi32_si128( *(int*)p );
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99754
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #2 fro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104773
Bug ID: 104773
Summary: compare with 1 not merged with subtract 1
Product: gcc
Version: 12.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97759
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #14 fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494
--- Comment #11 from Peter Cordes ---
Also, horizontal byte sums are generally best done with VPSADBW against a zero
vector, even if that means some fiddling to flip to unsigned first and then
undo the bias.
simde_vaddlv_s8:
vpxorxmm0, xm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102494
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #10 f
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80570
--- Comment #3 from Peter Cordes ---
(In reply to Andrew Pinski from comment #2)
> Even on aarch64:
>
> .L2:
> ldr q0, [x1], 16
> sxtlv1.2d, v0.2s
> sxtl2 v0.2d, v0.4s
> scvtf v1.2d, v1.2d
> sc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91103
--- Comment #9 from Peter Cordes ---
Thanks for implementing my idea :)
(In reply to Hongtao.liu from comment #6)
> For elements located above 128bits, it seems always better(?) to use
> valign{d,q}
TL:DR:
I think we should still use vextracti
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309
--- Comment #37 from Peter Cordes ---
Correction, PR82666 is that the cmov on the critical path happens even at -O2
(with GCC7 and later). Not just with -O3 -fno-tree-vectorize.
Anyway, that's related, but probably separate from choosing to do
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=56309
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #36 fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #5 fro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82940
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #6 fro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922
--- Comment #2 from Peter Cordes ---
Possibly also related:
With different surrounding code, this loop can compile to asm which has two
useless movz / mov register copies in the loop at -O2
(https://godbolt.org/z/PTcqzM6q7). (To set up for en
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100922
Bug ID: 100922
Summary: CSE leads to fully redundant (back to back)
zero-extending loads of the same thing in a loop, or a
register copy
Product: gcc
Version: 12
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88770
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #2 fro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80636
Peter Cordes changed:
What|Removed |Added
Status|NEW |RESOLVED
Resolution|---
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=42587
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #12 fr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98801
Peter Cordes changed:
What|Removed |Added
CC||peter at cordes dot ca
--- Comment #5 fro
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98291
Bug ID: 98291
Summary: multiple scalar FP accumulators auto-vectorize worse
than scalar, including vector load + merge instead of
scalar + high-half insert
Product: gcc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366
--- Comment #1 from Peter Cordes ---
Forgot to include https://godbolt.org/z/q44r13
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97366
Bug ID: 97366
Summary: [8/9/10/11 Regression] Redundant load with SSE/AVX
vector intrinsics
Product: gcc
Version: 11.0
Status: UNCONFIRMED
Keywords: missed-opti
40 matches
Mail list logo