https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298
Martin Jambor <jamborm at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2025-04-08 Ever confirmed|0 |1 --- Comment #5 from Martin Jambor <jamborm at gcc dot gnu.org> --- This is likely a store-to-load-forwarding stall issue. I have tracked it down to two SLP1 transformations - dumps below. Even though MeanShiftImage receives 10% more samples, its assembly is the same and the transformations slowing things down happens in GetVirtualPixelsFromNexus (which is being called from the hottest loop of MeanShiftImage). I have verified that the same slow-down happens with -flto-partition=one and then used option -fdbg-cnt=vect_slp:1-1043:1046-100000 (using unpatched master revision 337b9ff4854) to avoid the two SLP vectorizations and confirmed the run-time performance has been restored. The two SLP1 transformations are reported to take place at magick/cache.c:2573 but that is where the function GetVirtualPixelsFromNexus begins. Looking at dumps with -lineno, the vectorized statements are followed immediately by one from magick/random.c:629 (load of the normalize field at the end of function GetPseudoRandomValue). The SLP1 transformations are: --- /home/mjambor/gcc/benchmarks/cpu2017/benchspec/CPU/538.imagick_r/build/build_peak_trunk-lto-nat-m64.0000/gvpfn/imagick_r.ltrans1.ltrans.188t.slp1 2025-04-07 23:28:47.568570350 +0200 +++ gvpfn/imagick_r.ltrans1.ltrans.188t.slp1 2025-04-07 23:28:50.673570142 +0200 @@ -3897,54 +3899,10 @@ magick/cache.c:2699:7: note: created &virtual_pixel magick/cache.c:2699:7: note: add new stmt: MEM <vector(4) short unsigned int> [(short unsigned int *)&virtual_pixel] = { 0, 0, 0, 65535 }; magick/cache.c:2699:7: note: vectorizing stmts using SLP. -magick/cache.c:2573:33: optimized: basic block part vectorized using 32 byte vectors -magick/cache.c:2573:33: note: Vectorizing SLP tree: -magick/cache.c:2573:33: note: node 0x3648860 (max_nunits=4, refcnt=1) vector(4) long unsigned int -magick/cache.c:2573:33: note: op template: MEM[(long unsigned int *)prephitmp_900 + 32B] = _929; -magick/cache.c:2573:33: note: stmt 0 MEM[(long unsigned int *)prephitmp_900 + 32B] = _929; -magick/cache.c:2573:33: note: stmt 1 MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.193_931; -magick/cache.c:2573:33: note: stmt 2 MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.194_932; -magick/cache.c:2573:33: note: stmt 3 MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.195_930; -magick/cache.c:2573:33: note: children 0x3648a10 -magick/cache.c:2573:33: note: node (external) 0x3648a10 (max_nunits=1, refcnt=1) vector(4) long unsigned int -magick/cache.c:2573:33: note: { _929, D__lsm.193_931, D__lsm.194_932, D__lsm.195_930 } -magick/cache.c:2573:33: note: ------>vectorizing SLP node starting from: MEM[(long unsigned int *)prephitmp_900 + 32B] = _929; -magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.193_931 = PHI <D__lsm.193_32(80)>, type of def: internal -magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.194_932 = PHI <D__lsm.194_33(80)>, type of def: internal -magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.195_930 = PHI <D__lsm.195_474(80)>, type of def: internal -magick/cache.c:2573:33: note: transform store. ncopies = 1 -magick/cache.c:2573:33: note: create vector_type-pointer variable to type: vector(4) long unsigned int vectorizing a pointer ref: MEM[(long unsigned int *)prephitmp_900 + 32B] -magick/cache.c:2573:33: note: created vectp.214_881 -magick/cache.c:2573:33: note: add new stmt: MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.214_881] = _882; -magick/cache.c:2573:33: note: vectorizing stmts using SLP. -magick/cache.c:2573:33: optimized: basic block part vectorized using 32 byte vectors -magick/cache.c:2573:33: note: Vectorizing SLP tree: -magick/cache.c:2573:33: note: node 0x3648aa0 (max_nunits=4, refcnt=1) vector(4) long unsigned int -magick/cache.c:2573:33: note: op template: MEM[(long unsigned int *)prephitmp_900 + 32B] = _552; -magick/cache.c:2573:33: note: stmt 0 MEM[(long unsigned int *)prephitmp_900 + 32B] = _552; -magick/cache.c:2573:33: note: stmt 1 MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.189_606; -magick/cache.c:2573:33: note: stmt 2 MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.190_568; -magick/cache.c:2573:33: note: stmt 3 MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.191_342; -magick/cache.c:2573:33: note: children 0x3648c50 -magick/cache.c:2573:33: note: node (external) 0x3648c50 (max_nunits=1, refcnt=1) vector(4) long unsigned int -magick/cache.c:2573:33: note: { _552, D__lsm.189_606, D__lsm.190_568, D__lsm.191_342 } -magick/cache.c:2573:33: note: ------>vectorizing SLP node starting from: MEM[(long unsigned int *)prephitmp_900 + 32B] = _552; -magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.189_606 = PHI <D__lsm.189_574(82)>, type of def: internal -magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.190_568 = PHI <D__lsm.190_555(82)>, type of def: internal -magick/cache.c:2573:33: note: vect_is_simple_use: operand D__lsm.191_342 = PHI <D__lsm.191_106(82)>, type of def: internal -magick/cache.c:2573:33: note: transform store. ncopies = 1 -magick/cache.c:2573:33: note: create vector_type-pointer variable to type: vector(4) long unsigned int vectorizing a pointer ref: MEM[(long unsigned int *)prephitmp_900 + 32B] -magick/cache.c:2573:33: note: created vectp.216_874 -magick/cache.c:2573:33: note: add new stmt: MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.216_874] = _879; -magick/cache.c:2573:33: note: vectorizing stmts using SLP. magick/cache.c:2573:33: note: ***** The result for vector mode V64QI would be the same __attribute__((visibility ("default"), hot)) const struct PixelPacket * GetVirtualPixelsFromNexus (const struct Image * image, const VirtualPixelMethod virtual_pixel_method, const ssize_t x, const ssize_t y, const size_t columns, const size_t rows, struct NexusInfo * nexus_info, struct ExceptionInfo * exception) { - long unsigned int * vectp.216; - vector(4) long unsigned int * vectp_prephitmp.215; - long unsigned int * vectp.214; - vector(4) long unsigned int * vectp_prephitmp.213; Quantum * vectp.212; vector(4) short unsigned int * vectp_virtual_pixel.211; Quantum * vectp.210; @@ -4306,8 +4264,6 @@ long unsigned int pretmp_875; long int _877; long int prephitmp_878; - vector(4) long unsigned int _879; - vector(4) long unsigned int _882; long unsigned int pretmp_889; long int _891; long int prephitmp_892; @@ -5354,9 +5310,10 @@ # D__lsm.193_931 = PHI <D__lsm.193_32(80)> # D__lsm.195_930 = PHI <D__lsm.195_474(80)> # _929 = PHI <_543(80)> - _882 = {_929, D__lsm.193_931, D__lsm.194_932, D__lsm.195_930}; - vectp.214_881 = prephitmp_900 + 32; - MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.214_881] = _882; + MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.193_931; + MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.194_932; + MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.195_930; + MEM[(long unsigned int *)prephitmp_900 + 32B] = _929; [magick/random.c:629:3] # DEBUG BEGIN_STMT [magick/random.c:629:21] _544 = [magick/random.c:629:21] MEM[(struct RandomInfo *)prephitmp_900].normalize; [magick/random.c:629:32] _545 = (double) _929; @@ -5413,9 +5370,10 @@ # D__lsm.191_342 = PHI <D__lsm.191_106(82)> # D__lsm.190_568 = PHI <D__lsm.190_555(82)> # D__lsm.189_606 = PHI <D__lsm.189_574(82)> - _879 = {_552, D__lsm.189_606, D__lsm.190_568, D__lsm.191_342}; - vectp.216_874 = prephitmp_900 + 32; - MEM <vector(4) long unsigned int> [(long unsigned int *)vectp.216_874] = _879; + MEM[(long unsigned int *)prephitmp_900 + 40B] = D__lsm.189_606; + MEM[(long unsigned int *)prephitmp_900 + 48B] = D__lsm.190_568; + MEM[(long unsigned int *)prephitmp_900 + 56B] = D__lsm.191_342; + MEM[(long unsigned int *)prephitmp_900 + 32B] = _552; [magick/random.c:629:3] # DEBUG BEGIN_STMT [magick/random.c:629:32] _531 = (double) _552; _502 = _230 * _544;