https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107389
--- Comment #2 from Richard Biener <rguenth at gcc dot gnu.org> --- Btw, with typedef __uint128_t aligned_type __attribute__((aligned(16))); _Static_assert(__alignof(aligned_type) == 16); __uint128_t foo(aligned_type *p) { p = __builtin_assume_aligned (p, 16); return __atomic_load_n(p, 0); } I see foo: .LFB0: .cfi_startproc lpq %r4,0(%r3) stmg %r4,%r5,0(%r2) br %r14 at -O2 but without the __builtin_assume_aligned optimization doesn't help much. And without optimization but __builtin_assume_aligned in place we simply leave that around - we probably should have elided it in fold-all-builtins and set alignment on the destination SSA name also when not optimizing (we do that there when optimizing), or do the same during RTL expansion.