https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89967
--- Comment #7 from Andrew Pinski <pinskia at gcc dot gnu.org> --- Ok, so my current patches actually make it worse because SRA decided to scalarize inbetween the builtins. That is because we don't look back over things that don't alias (in this case the clobber): ``` # .MEM_122 = VDEF <.MEM_121> g1b = D.23266; # .MEM_123 = VDEF <.MEM_122> __b ={v} {CLOBBER(eos)}; _91 = BIT_FIELD_REF <o1v_39(D), 32, 96>; _15 = (sizetype) _91; _16 = in_53(D) + _15; # .MEM_127 = VDEF <.MEM_123> __b = g1b; # .MEM_128 = VDEF <.MEM_127> D.23259 = __builtin_aarch64_ld2_lanev16qi_usus (_16, g1b, 12); ``` That is with -fstack-reuse=none, we get 4 mov in the non-loop case while the loop case it is still bad (and maybe even worse). Note the 4 mov here is an RTL issue and maybe even a cost issue; I have not fully looked into that; movi should be just as cheap as mov so I am not sure why we are doing CSE of a constant here. The loop case is we have: a = ... L1: b = ...(a); c = ...(b); use a.val[0], a.val[1] a = b; if (...) goto L1; Without the copy prop, SRA is a little smarter at scalarizing the copies but afterwards SRA is like we need to scalarize a, b, and c because why not.