https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115438

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Last reconfirmed|2024-07-03 00:00:00         |2024-11-25

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
Re-confirmed.  On Zen4 it's bi_cgstab_block_

Overhead       Samples  Command          Shared Object           Symbol         
  29.22%        245629  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.]
mat_times_vec_
  27.94%        234346  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.]
mat_times_vec_
   9.47%         79765  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.]
bi_cgstab_block_
   7.69%         64082  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.] shell_
   7.50%         62790  bwaves_r_peak.g  bwaves_r_peak.gcc7-m64  [.] shell_
   6.35%         53329  bwaves_r_base.g  bwaves_r_base.gcc7-m64  [.]
bi_cgstab_block_

C       Compute local rhat * v norm
                tmp=0.
                do k=2,nzl+1
                   do j=1,ny
                      do i=1,nx
                         do l=1,nb
                            tmp=tmp+rhat(l,i,j,k)*v(l,i,j,k)
                         enddo
                      enddo
                   enddo
                enddo

where with GCC 14.2 we end up scheduling the AVX vector epilog with the
following scalar epilog while trunk fails to do this.  The SSE epilogue
isn't created on trunk as it is deemed unprofitable.

This might in the end be fallout of different sinking?!

One difference wrt SLP vs. non-SLP is that with SLP we are taking the
initial value as the initial value with SLP while with non-SLP we
are using zero as initial reduction value and compensating at the epilouge:

  _1615 = {tmp_111, 0.0, 0.0, 0.0};
  # _1619 = PHI <_1618(116), _1615(119)>
...
  _1623 = .REDUC_PLUS (vect_tmp_1505.835_1621);

vs.

  # _1346 = PHI <_1345(98), { 0.0, 0.0, 0.0, 0.0 }(94)>
...
  _1385 = .REDUC_PLUS (vect_tmp_1268.744_1383);
  _1386 = tmp_710 + _1385;

so while the profile clearly shows a difference between GCC 14.2 and trunk
I can't yet pinpoint to what makes the difference.

The same can be seen for the other similar loops in this function.

Reply via email to