https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115438
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Last reconfirmed|2024-07-03 00:00:00 |2024-11-25 --- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> --- Re-confirmed. On Zen4 it's bi_cgstab_block_ Overhead Samples Command Shared Object Symbol 29.22% 245629 bwaves_r_peak.g bwaves_r_peak.gcc7-m64 [.] mat_times_vec_ 27.94% 234346 bwaves_r_base.g bwaves_r_base.gcc7-m64 [.] mat_times_vec_ 9.47% 79765 bwaves_r_peak.g bwaves_r_peak.gcc7-m64 [.] bi_cgstab_block_ 7.69% 64082 bwaves_r_base.g bwaves_r_base.gcc7-m64 [.] shell_ 7.50% 62790 bwaves_r_peak.g bwaves_r_peak.gcc7-m64 [.] shell_ 6.35% 53329 bwaves_r_base.g bwaves_r_base.gcc7-m64 [.] bi_cgstab_block_ C Compute local rhat * v norm tmp=0. do k=2,nzl+1 do j=1,ny do i=1,nx do l=1,nb tmp=tmp+rhat(l,i,j,k)*v(l,i,j,k) enddo enddo enddo enddo where with GCC 14.2 we end up scheduling the AVX vector epilog with the following scalar epilog while trunk fails to do this. The SSE epilogue isn't created on trunk as it is deemed unprofitable. This might in the end be fallout of different sinking?! One difference wrt SLP vs. non-SLP is that with SLP we are taking the initial value as the initial value with SLP while with non-SLP we are using zero as initial reduction value and compensating at the epilouge: _1615 = {tmp_111, 0.0, 0.0, 0.0}; # _1619 = PHI <_1618(116), _1615(119)> ... _1623 = .REDUC_PLUS (vect_tmp_1505.835_1621); vs. # _1346 = PHI <_1345(98), { 0.0, 0.0, 0.0, 0.0 }(94)> ... _1385 = .REDUC_PLUS (vect_tmp_1268.744_1383); _1386 = tmp_710 + _1385; so while the profile clearly shows a difference between GCC 14.2 and trunk I can't yet pinpoint to what makes the difference. The same can be seen for the other similar loops in this function.