[Bug tree-optimization/115438] [15 Regression] 503.bwaves_r regressed 5-11% on different x86_64 machines at -Ofast -march=native since r15-1006-gd93353e6423eca

rguenth at gcc dot gnu.org via Gcc-bugs Mon, 25 Nov 2024 06:52:24 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115438


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2024-07-03 00:00:00         |2024-11-25

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Re-confirmed.  On Zen4 it's bi_cgstab_block_

Overhead       Samples  Command          Shared Object           Symbol         
  29.22%        245629  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.]
mat_times_vec_
  27.94%        234346  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.]
mat_times_vec_
   9.47%         79765  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.]
bi_cgstab_block_
   7.69%         64082  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.] shell_
   7.50%         62790  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.] shell_
   6.35%         53329  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.]
bi_cgstab_block_

C       Compute local rhat * v norm
                tmp=0.
                do k=2,nzl+1
                   do j=1,ny
                      do i=1,nx
                         do l=1,nb
                            tmp=tmp+rhat(l,i,j,k)*v(l,i,j,k)
                         enddo
                      enddo
                   enddo
                enddo

where with GCC 14.2 we end up scheduling the AVX vector epilog with the
following scalar epilog while trunk fails to do this.  The SSE epilogue
isn't created on trunk as it is deemed unprofitable.

This might in the end be fallout of different sinking?!

One difference wrt SLP vs. non-SLP is that with SLP we are taking the
initial value as the initial value with SLP while with non-SLP we
are using zero as initial reduction value and compensating at the epilouge:

  _1615 = {tmp_111, 0.0, 0.0, 0.0};
  # _1619 = PHI <_1618(116), _1615(119)>
...
  _1623 = .REDUC_PLUS (vect_tmp_1505.835_1621);

vs.

  # _1346 = PHI <_1345(98), { 0.0, 0.0, 0.0, 0.0 }(94)>
...
  _1385 = .REDUC_PLUS (vect_tmp_1268.744_1383);
  _1386 = tmp_710 + _1385;

so while the profile clearly shows a difference between GCC 14.2 and trunk
I can't yet pinpoint to what makes the difference.

The same can be seen for the other similar loops in this function.

[Bug tree-optimization/115438] [15 Regression] 503.bwaves_r regressed 5-11% on different x86_64 machines at -Ofast -march=native since r15-1006-gd93353e6423eca

Reply via email to