https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119108
--- Comment #6 from Tamar Christina <tnfchris at gcc dot gnu.org> --- Ok, now really confirmed :) Interestingly the behavior on other uarches suggests this may be cost modelling. On Neoverse-V1 we get (without LTO): BM_UFlat/0/1 -4.60251 BM_UFlat/0/2 -2.34742 BM_UFlat/3/1 4 BM_UFlat/3/2 4.21053 BM_UFlat/10/1 -5.71429 BM_UFlat/10/2 -4.25532 BM_UValidate/1/2 -2.26415 BM_UValidate/2/1 -6.55738 BM_UValidate/2/2 -5.78512 BM_UValidate/3/1 -7.14286 BM_UValidate/3/2 -7.5 BM_UIOVecSource/0/2 6.93069 BM_UIOVecSource/1/2 2.80549 BM_UIOVecSource/3/2 2.03488 BM_UIOVecSource/5/2 4.05983 BM_UIOVecSource/6/2 2.52427 BM_UIOVecSource/7/2 3.31858 BM_UIOVecSource/8/2 3.06486 BM_UIOVecSource/9/2 2.66458 BM_UIOVecSource/10/2 6.66667 BM_UIOVecSource/11/2 3.9801 BM_UIOVecSink/0 -4.9062 BM_UFlatSink/0/1 -2.08333 BM_UFlatSink/0/2 -5.09259 BM_UFlatSink/3/2 -2 BM_UFlatSink/10/1 -4.24528 BM_UFlatSink/10/2 -5.26316 So gains and losses. That said.. the code generates is inefficient. Let me address that first, I have a patch for some of these that I never finished.