https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107389
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, with
typedef __uint128_t aligned_type __attribute__((aligned(16)));
_Static_assert(__alignof(aligned_type) == 16);
__uint128_t foo(aligned_type *p) { p = __builtin_assume_aligned (p, 16); return
__atomic_load_n(p, 0); }
I see
foo:
.LFB0:
.cfi_startproc
lpq %r4,0(%r3)
stmg %r4,%r5,0(%r2)
br %r14
at -O2 but without the __builtin_assume_aligned optimization doesn't help much.
And without optimization but __builtin_assume_aligned in place we simply
leave that around - we probably should have elided it in
fold-all-builtins and set alignment on the destination SSA name also
when not optimizing (we do that there when optimizing), or do the same
during RTL expansion.