https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119919
Jan Hubicka <hubicka at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Ever confirmed|0 |1
Status|UNCONFIRMED |ASSIGNED
Last reconfirmed| |2025-04-24
Assignee|unassigned at gcc dot gnu.org |hubicka at gcc dot
gnu.org
--- Comment #2 from Jan Hubicka <hubicka at gcc dot gnu.org> ---
This is with -O2 only. Difference is
+++ bbb 2025-04-24 16:21:25.029155295 +0200
@@ -108,10 +108,7 @@
exchange2.fppized.f90:1027:58: optimized: loop vectorized using 16 byte
vectors
exchange2.fppized.f90:1019:71: optimized: loop vectorized using 8 byte vectors
exchange2.fppized.f90:1016:55: optimized: loop vectorized using 16 byte
vectors
-exchange2.fppizedf90:1003:32: optimized: loop vectorized using 8 byte vectors
exchange2.fppized.f90:1123:83: optimized: loop with 1 iterations completely
unrolled (header execution count 119292720)
-exchange2.fppized.f90:1003:32: optimized: loop turned into non-loop; it never
loops
-exchange2.fppized.f90:1003:32: optimized: loop turned into non-loop; it never
loops
exchange2.fppized.f90:1203:51: optimized: loop unrolled 1 times
exchange2.fppized.f90:1194:54: optimized: loop unrolled 1 times
exchange2.fppized.f90:1185:57: optimized: loop unrolled 1 times
before patch we get
*_45 1 times scalar_load costs 12 in prologue
u[_47] 1 times scalar_load costs 12 in prologue
_46 ? _ifc__1856 : 9 1 times scalar_stmt costs 4 in prologue
_ifc__1854 1 times scalar_store costs 12 in prologue
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
_7 != 0 4 times vector_stmt costs 16 in body
<unknown> 1 times vector_load costs 12 in prologue
_8 ? 1 : 0 4 times vector_stmt costs 16 in body
<unknown> 1 times vector_load costs 12 in prologue
<unknown> 1 times vector_load costs 12 in prologue
(unsigned char) patt_2784 1 times vec_promote_demote costs 4 in body
(unsigned char) patt_2784 2 times vec_promote_demote costs 8 in body
patt_2785 1 times vector_store costs 12 in body
exchange2.fppized.f90:1003:32: note: Cost model analysis:
Vector inside of loop cost: 168
Vector prologue cost: 36
Vector epilogue cost: 28
Scalar iteration cost: 28
Scalar outside cost: 0
Vector outside cost: 64
prologue iterations: 0
epilogue iterations: 1
Calculated minimum iters for profitability: 7
<bb 3> [local count: 6974165]:
# _1815 = PHI <_13(294), 1(2)>
# _905 = PHI <_12(294), 0(2)>
# ivtmp_1876 = PHI <ivtmp_1875(294), 9(2)>
# ivtmp_2808 = PHI <ivtmp_2809(294), _2807(2)>
# vectp_temp.3211_2840 = PHI <vectp_temp.3211_2841(294), &temp.862(2)>
# ivtmp_2843 = PHI <ivtmp_2844(294), 0(2)>
_5 = _1815 * 9;
_6 = _3 + _5;
_2810 = MEM[(int *)ivtmp_2808];
ivtmp_2811 = ivtmp_2808 + 36;
_2812 = MEM[(int *)ivtmp_2811];
ivtmp_2813 = ivtmp_2811 + 36;
vect_cst__2814 = {_2810, _2812};
_2815 = MEM[(int *)ivtmp_2813];
ivtmp_2816 = ivtmp_2813 + 36;
_2817 = MEM[(int *)ivtmp_2816];
ivtmp_2818 = ivtmp_2816 + 36;
vect_cst__2819 = {_2815, _2817};
_2820 = MEM[(int *)ivtmp_2818];
ivtmp_2821 = ivtmp_2818 + 36;
_2822 = MEM[(int *)ivtmp_2821];
ivtmp_2823 = ivtmp_2821 + 36;
vect_cst__2824 = {_2820, _2822};
_2825 = MEM[(int *)ivtmp_2823];
ivtmp_2826 = ivtmp_2823 + 36;
_2827 = MEM[(int *)ivtmp_2826];
vect_cst__2828 = {_2825, _2827};
mask__8.3207_2829 = { 0, 0 } != vect_cst__2814;
mask__8.3207_2830 = { 0, 0 } != vect_cst__2819;
mask__8.3207_2831 = { 0, 0 } != vect_cst__2824;
mask__8.3207_2832 = { 0, 0 } != vect_cst__2828;
vect_patt_2784.3208_2833 = VEC_COND_EXPR <mask__8.3207_2829, { 1, 1 }, { 0, 0
}>;
vect_patt_2784.3208_2834 = VEC_COND_EXPR <mask__8.3207_2830, { 1, 1 }, { 0, 0
}>;
vect_patt_2784.3208_2835 = VEC_COND_EXPR <mask__8.3207_2831, { 1, 1 }, { 0, 0
}>;
vect_patt_2784.3208_2836 = VEC_COND_EXPR <mask__8.3207_2832, { 1, 1 }, { 0, 0
}>;
vect_patt_2785.3210_2837 = VEC_PACK_TRUNC_EXPR <vect_patt_2784.3208_2833,
vect_patt_2784.3208_2834>;
vect_patt_2785.3210_2838 = VEC_PACK_TRUNC_EXPR <vect_patt_2784.3208_2835,
vect_patt_2784.3208_2836>;
vect_patt_2785.3209_2839 = VEC_PACK_TRUNC_EXPR <vect_patt_2785.3210_2837,
vect_patt_2785.3210_2838>;
_7 = sudoku1[_6];
_8 = _7 != 0;
_10 = (sizetype) _905;
_11 = &temp.862 + _10;
MEM <vector(8) unsigned char> [(logical(kind=1) *)vectp_temp.3211_2840] =
vect_patt_2785.3209_2839;
_12 = _905 + 1;
_13 = _1815 + 1;
ivtmp_1875 = ivtmp_1876 - 1;
ivtmp_2809 = ivtmp_2808 + 288;
vectp_temp.3211_2841 = vectp_temp.3211_2840 + 8;
ivtmp_2844 = ivtmp_2843 + 1;
vectp_temp.3211_2841 = vectp_temp.3211_2840 + 8;
ivtmp_2844 = ivtmp_2843 + 1;
if (ivtmp_2844 >= 1)
goto <bb 580>; [100.00%]
else
goto <bb 294>; [0.00%]
after patch
*_45 1 times scalar_load costs 12 in prologue
u[_47] 1 times scalar_load costs 12 in prologue
_46 ? _ifc__1856 : 9 1 times scalar_stmt costs 8 in prologue
_ifc__1854 1 times scalar_store costs 12 in prologue
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times scalar_load costs 12 in body
sudoku1[_6] 1 times vec_construct costs 4 in body
_7 != 0 4 times vector_stmt costs 16 in body
<unknown> 1 times vector_load costs 12 in prologue
_8 ? 1 : 0 4 times vector_stmt costs 64 in body
<unknown> 1 times vector_load costs 12 in prologue
<unknown> 1 times vector_load costs 12 in prologue
(unsigned char) patt_2784 1 times vec_promote_demote costs 4 in body
(unsigned char) patt_2784 2 times vec_promote_demote costs 8 in body
Vector inside of loop cost: 216
Vector prologue cost: 36
Vector epilogue cost: 28
Scalar iteration cost: 28
Scalar outside cost: 0
Vector outside cost: 64
prologue iterations: 0
epilogue iterations: 1
<bb 3> [local count: 62767486]:
# _1815 = PHI <_13(294), 1(2)>
# _905 = PHI <_12(294), 0(2)>
# ivtmp_1876 = PHI <ivtmp_1875(294), 9(2)>
_5 = _1815 * 9;
_6 = _3 + _5;
_7 = sudoku1[_6];
_8 = _7 != 0;
_10 = (sizetype) _905;
_11 = &temp.862 + _10;
*_11 = _8;
_12 = _905 + 1;
_13 = _1815 + 1;
ivtmp_1875 = ivtmp_1876 - 1;
if (ivtmp_1875 == 0)
goto <bb 230>; [11.11%]
else
goto <bb 294>; [88.89%]
So the loop iterates 9 times and I guess main reason why it is profitable is
elimination of it.
Since we now cost _8 ? 1 : 0 4 times as 64 instead of 16, we decide to not
vectorize.