Tamar Christina <tamar.christ...@arm.com> writes: >> -----Original Message----- >> From: Richard Sandiford <richard.sandif...@arm.com> >> Sent: Friday, September 20, 2024 3:48 PM >> To: Tamar Christina <tamar.christ...@arm.com> >> Cc: gcc-patches@gcc.gnu.org; nd <n...@arm.com>; Richard Earnshaw >> <richard.earns...@arm.com>; Marcus Shawcroft >> <marcus.shawcr...@arm.com>; ktkac...@gcc.gnu.org >> Subject: Re: [PATCH]AArch64: Take into account when VF is higher than known >> scalar iters >> >> Tamar Christina <tamar.christ...@arm.com> writes: >> >> >> >> So my gut instinct is that we should instead tweak the condition for >> >> using latency costs, but I'll need to think about it more when I get >> >> back from holiday. >> >> >> > >> > I think that's a separate problem.. From first principals it should already >> > be very wrong to compare the scalar loop to an iteration count it will >> > *NEVER* reach. So I don't understand why that would ever be valid. >> >> But I don't think we're doing that, or at least, not as the final result. >> Instead, we first calculate the minimum number of vector iterations for >> which the vector loop is sometimes profitable. If this is N, then we're >> saying that the vector code is better than the scalar code for N*VF >> iterations. Like you say, this part ignores whether N*VF is actually >> achievable. But then: >> >> /* Now that we know the minimum number of vector iterations, >> find the minimum niters for which the scalar cost is larger: >> >> SIC * niters > VIC * vniters + VOC - SOC >> >> We know that the minimum niters is no more than >> vniters * VF + NPEEL, but it might be (and often is) less >> than that if a partial vector iteration is cheaper than the >> equivalent scalar code. */ >> int threshold = (vec_inside_cost * min_vec_niters >> + vec_outside_cost >> - scalar_outside_cost); >> if (threshold <= 0) >> min_profitable_iters = 1; >> else >> min_profitable_iters = threshold / scalar_single_iter_cost + 1; >> >> calculates which number of iterations in the range [(N-1)*VF + 1, N*VF] >> is the first to be profitable. This is specifically taking partial >> iterations into account and includes the N==1 case. The lower niters is, >> the easier it is for the scalar code to win. >> >> This is what is printed as: >> >> Calculated minimum iters for profitability: 7 >> >> So we think that vectorisation should be rejected if the loop count >> is <= 6, but accepted if it's >= 7. > > This 7 is the vector iteration count. > > epilogue iterations: 0 > Minimum number of vector iterations: 1 > Calculated minimum iters for profitability: 7 > /app/example.c:4:21: note: Runtime profitability threshold = 7 > /app/example.c:4:21: note: Static estimate profitability threshold = 7 > > Which says the vector code has to iterate at least 7 iteration for it to be > profitable.
It doesn't though: > Minimum number of vector iterations: 1 This is in vector iterations but: > Calculated minimum iters for profitability: 7 This is in scalar iterations. (Yes, it would be nice if the dump line was more explicit. :)) This is why, if we change the loop count to 7 rather than 9: for (int i = 0; i < 7; i++) we still get: /tmp/foo.c:4:21: note: Cost model analysis: Vector inside of loop cost: 20 Vector prologue cost: 6 Vector epilogue cost: 0 Scalar iteration cost: 4 Scalar outside cost: 0 Vector outside cost: 6 prologue iterations: 0 epilogue iterations: 0 Minimum number of vector iterations: 1 Calculated minimum iters for profitability: 7 /tmp/foo.c:4:21: note: Runtime profitability threshold = 7 /tmp/foo.c:4:21: note: Static estimate profitability threshold = 7 /tmp/foo.c:4:21: note: ***** Analysis succeeded with vector mode VNx4SI But if we change it to 6: for (int i = 0; i < 6; i++) we get: /tmp/foo.c:4:21: note: Cost model analysis: Vector inside of loop cost: 20 Vector prologue cost: 6 Vector epilogue cost: 0 Scalar iteration cost: 4 Scalar outside cost: 0 Vector outside cost: 6 prologue iterations: 0 epilogue iterations: 0 Minimum number of vector iterations: 1 Calculated minimum iters for profitability: 7 /tmp/foo.c:4:21: note: Runtime profitability threshold = 7 /tmp/foo.c:4:21: note: Static estimate profitability threshold = 7 /tmp/foo.c:4:21: missed: not vectorized: vectorization not profitable. /tmp/foo.c:4:21: note: not vectorized: iteration count smaller than user specified loop bound parameter or minimum profitable iterations (whichever is more conservative). /tmp/foo.c:4:21: missed: Loop costings not worthwhile. /tmp/foo.c:4:21: note: ***** Analysis failed with vector mode VNx4SI Thanks, Richard