https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298

Martin Jambor <jamborm at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2025-04-08
     Ever confirmed|0                           |1

--- Comment #5 from Martin Jambor <jamborm at gcc dot gnu.org> ---
This is likely a store-to-load-forwarding stall issue.  I have tracked
it down to two SLP1 transformations - dumps below.

Even though MeanShiftImage receives 10% more samples, its assembly is
the same and the transformations slowing things down happens in
GetVirtualPixelsFromNexus (which is being called from the hottest loop
of MeanShiftImage).

I have verified that the same slow-down happens with
-flto-partition=one and then used option
-fdbg-cnt=vect_slp:1-1043:1046-100000 (using unpatched master revision
337b9ff4854) to avoid the two SLP vectorizations and confirmed the
run-time performance has been restored.

The two SLP1 transformations are reported to take place at
magick/cache.c:2573 but that is where the function
GetVirtualPixelsFromNexus begins.  Looking at dumps with -lineno, the
vectorized statements are followed immediately by one from
magick/random.c:629 (load of the normalize field at the end of
function GetPseudoRandomValue).

The SLP1 transformations are:

---
/home/mjambor/gcc/benchmarks/cpu2017/benchspec/CPU/538.imagick_r/build/build_peak_trunk-lto-nat-m64.0000/gvpfn/imagick_r.ltrans1.ltrans.188t.slp1
  2025-04-07 23:28:47.568570350 +0200
+++ gvpfn/imagick_r.ltrans1.ltrans.188t.slp1    2025-04-07 23:28:50.673570142
+0200
@@ -3897,54 +3899,10 @@
 magick/cache.c:2699:7: note: created &virtual_pixel
 magick/cache.c:2699:7: note: add new stmt: MEM <vector(4) short unsigned int>
[(short unsigned int *)&virtual_pixel] = { 0, 0, 0, 65535 };
 magick/cache.c:2699:7: note: vectorizing stmts using SLP.
-magick/cache.c:2573:33: optimized: basic block part vectorized using 32 byte
vectors
-magick/cache.c:2573:33: note: Vectorizing SLP tree:
-magick/cache.c:2573:33: note: node 0x3648860 (max_nunits=4, refcnt=1)
vector(4) long unsigned int
-magick/cache.c:2573:33: note: op template: MEM[(long unsigned int
*)prephitmp_900 + 32B] = _929;
-magick/cache.c:2573:33: note:  stmt 0 MEM[(long unsigned int *)prephitmp_900 +
32B] = _929;
-magick/cache.c:2573:33: note:  stmt 1 MEM[(long unsigned int *)prephitmp_900 +
40B] = D__lsm.193_931;
-magick/cache.c:2573:33: note:  stmt 2 MEM[(long unsigned int *)prephitmp_900 +
48B] = D__lsm.194_932;
-magick/cache.c:2573:33: note:  stmt 3 MEM[(long unsigned int *)prephitmp_900 +
56B] = D__lsm.195_930;
-magick/cache.c:2573:33: note:  children 0x3648a10
-magick/cache.c:2573:33: note: node (external) 0x3648a10 (max_nunits=1,
refcnt=1) vector(4) long unsigned int
-magick/cache.c:2573:33: note:  { _929, D__lsm.193_931, D__lsm.194_932,
D__lsm.195_930 }
-magick/cache.c:2573:33: note: ------>vectorizing SLP node starting from:
MEM[(long unsigned int *)prephitmp_900 + 32B] = _929;
-magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.193_931 = PHI
<D__lsm.193_32(80)>, type of def: internal
-magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.194_932 = PHI
<D__lsm.194_33(80)>, type of def: internal
-magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.195_930 = PHI
<D__lsm.195_474(80)>, type of def: internal
-magick/cache.c:2573:33: note: transform store. ncopies = 1
-magick/cache.c:2573:33: note: create vector_type-pointer variable to type:
vector(4) long unsigned int  vectorizing a pointer ref: MEM[(long unsigned int
*)prephitmp_900 + 32B]
-magick/cache.c:2573:33: note: created vectp.214_881
-magick/cache.c:2573:33: note: add new stmt: MEM <vector(4) long unsigned int>
[(long unsigned int *)vectp.214_881] = _882;
-magick/cache.c:2573:33: note: vectorizing stmts using SLP.
-magick/cache.c:2573:33: optimized: basic block part vectorized using 32 byte
vectors
-magick/cache.c:2573:33: note: Vectorizing SLP tree:
-magick/cache.c:2573:33: note: node 0x3648aa0 (max_nunits=4, refcnt=1)
vector(4) long unsigned int
-magick/cache.c:2573:33: note: op template: MEM[(long unsigned int
*)prephitmp_900 + 32B] = _552;
-magick/cache.c:2573:33: note:  stmt 0 MEM[(long unsigned int *)prephitmp_900 +
32B] = _552;
-magick/cache.c:2573:33: note:  stmt 1 MEM[(long unsigned int *)prephitmp_900 +
40B] = D__lsm.189_606;
-magick/cache.c:2573:33: note:  stmt 2 MEM[(long unsigned int *)prephitmp_900 +
48B] = D__lsm.190_568;
-magick/cache.c:2573:33: note:  stmt 3 MEM[(long unsigned int *)prephitmp_900 +
56B] = D__lsm.191_342;
-magick/cache.c:2573:33: note:  children 0x3648c50
-magick/cache.c:2573:33: note: node (external) 0x3648c50 (max_nunits=1,
refcnt=1) vector(4) long unsigned int
-magick/cache.c:2573:33: note:  { _552, D__lsm.189_606, D__lsm.190_568,
D__lsm.191_342 }
-magick/cache.c:2573:33: note: ------>vectorizing SLP node starting from:
MEM[(long unsigned int *)prephitmp_900 + 32B] = _552;
-magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.189_606 = PHI
<D__lsm.189_574(82)>, type of def: internal
-magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.190_568 = PHI
<D__lsm.190_555(82)>, type of def: internal
-magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.191_342 = PHI
<D__lsm.191_106(82)>, type of def: internal
-magick/cache.c:2573:33: note: transform store. ncopies = 1
-magick/cache.c:2573:33: note: create vector_type-pointer variable to type:
vector(4) long unsigned int  vectorizing a pointer ref: MEM[(long unsigned int
*)prephitmp_900 + 32B]
-magick/cache.c:2573:33: note: created vectp.216_874
-magick/cache.c:2573:33: note: add new stmt: MEM <vector(4) long unsigned int>
[(long unsigned int *)vectp.216_874] = _879;
-magick/cache.c:2573:33: note: vectorizing stmts using SLP.
 magick/cache.c:2573:33: note: ***** The result for vector mode V64QI would be
the same
 __attribute__((visibility ("default"), hot))
 const struct PixelPacket * GetVirtualPixelsFromNexus (const struct Image *
image, const VirtualPixelMethod virtual_pixel_method, const ssize_t x, const
ssize_t y, const size_t columns, const size_t rows, struct NexusInfo *
nexus_info, struct ExceptionInfo * exception)
 {
-  long unsigned int * vectp.216;
-  vector(4) long unsigned int * vectp_prephitmp.215;
-  long unsigned int * vectp.214;
-  vector(4) long unsigned int * vectp_prephitmp.213;
   Quantum * vectp.212;
   vector(4) short unsigned int * vectp_virtual_pixel.211;
   Quantum * vectp.210;
@@ -4306,8 +4264,6 @@
   long unsigned int pretmp_875;
   long int _877;
   long int prephitmp_878;
-  vector(4) long unsigned int _879;
-  vector(4) long unsigned int _882;
   long unsigned int pretmp_889;
   long int _891;
   long int prephitmp_892;
@@ -5354,9 +5310,10 @@
   # D__lsm.193_931 = PHI <D__lsm.193_32(80)>
   # D__lsm.195_930 = PHI <D__lsm.195_474(80)>
   # _929 = PHI <_543(80)>
-  _882 = {_929, D__lsm.193_931, D__lsm.194_932, D__lsm.195_930};
-  vectp.214_881 = prephitmp_900 + 32;
-  MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.214_881] =
_882;
+  MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.193_931;
+  MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.194_932;
+  MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.195_930;
+  MEM[(long unsigned int *)prephitmp_900 + 32B] = _929;
   [magick/random.c:629:3] # DEBUG BEGIN_STMT
   [magick/random.c:629:21] _544 = [magick/random.c:629:21] MEM[(struct
RandomInfo *)prephitmp_900].normalize;
   [magick/random.c:629:32] _545 = (double) _929;
@@ -5413,9 +5370,10 @@
   # D__lsm.191_342 = PHI <D__lsm.191_106(82)>
   # D__lsm.190_568 = PHI <D__lsm.190_555(82)>
   # D__lsm.189_606 = PHI <D__lsm.189_574(82)>
-  _879 = {_552, D__lsm.189_606, D__lsm.190_568, D__lsm.191_342};
-  vectp.216_874 = prephitmp_900 + 32;
-  MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.216_874] =
_879;
+  MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.189_606;
+  MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.190_568;
+  MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.191_342;
+  MEM[(long unsigned int *)prephitmp_900 + 32B] = _552;
   [magick/random.c:629:3] # DEBUG BEGIN_STMT
   [magick/random.c:629:32] _531 = (double) _552;
   _502 = _230 * _544;

Reply via email to