https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121315
--- Comment #3 from Alex Coplan <acoplan at gcc dot gnu.org> --- Here is a reduced testcase (compile with -O3 -mcpu=neoverse-v2): void copyReverseGeneric(int *dst, int *src) { for (int i = 0; i < 10000; ++i) dst[i] = __builtin_bswap32(src[i]); } of course using LDP/STP here would result in an extra add over the current codegen (even auto-inc LDP/STP doesn't come for free), but maybe it is worthwhile. I will look into it.