Hi Bill, Thanks for the review!
on 2021/9/18 上午12:34, Bill Schmidt wrote: > Hi Kewen, > > On 9/15/21 8:14 PM, Kewen.Lin wrote: >> Hi, >> >> This patch follows the discussion here[1], where Segher pointed >> out the existing way to guard the extra penalized cost for >> strided/elementwise loads with a magic bound doesn't scale. >> >> The way with nunits * stmt_cost can get one much exaggerated >> penalized cost, such as: for V16QI on P8, it's 16 * 20 = 320, >> that's why we need one bound. To make it scale, this patch >> doesn't use nunits * stmt_cost any more, but it still keeps >> nunits since there are actually nunits scalar loads there. So >> it uses one cost adjusted from stmt_cost, since the current >> stmt_cost sort of considers nunits, we can stablize the cost >> for big nunits and retain the cost for small nunits. After >> some tries, this patch gets the adjusted cost as: >> >> stmt_cost / (log2(nunits) * log2(nunits)) >> >> For V16QI, the adjusted cost would be 1 and total penalized >> cost is 16, it isn't exaggerated. For V2DI, the adjusted >> cost would be 2 and total penalized cost is 4, which is the >> same as before. btw, I tried to use one single log2(nunits), >> but the penalized cost is still big enough and can't fix the >> degraded bmk blender_r. >> >> The separated SPEC2017 evaluations on Power8, Power9 and Power10 >> at option sets O2-vect and Ofast-unroll showed this change is >> neutral (that is same effect as before). >> >> Bootstrapped and regress-tested on powerpc64le-linux-gnu Power9. >> >> Is it ok for trunk? >> >> [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579121.html >> >> BR, >> Kewen >> ----- >> gcc/ChangeLog: >> >> * config/rs6000/rs6000.c (rs6000_update_target_cost_per_stmt): Adjust >> the way to compute extra penalized cost. >> >> --- >> gcc/config/rs6000/rs6000.c | 28 +++++++++++++++++----------- >> 1 file changed, 17 insertions(+), 11 deletions(-) >> >> diff --git a/gcc/config/rs6000/rs6000.c b/gcc/config/rs6000/rs6000.c >> index 4ab23b0ab33..e08b94c0447 100644 >> --- a/gcc/config/rs6000/rs6000.c >> +++ b/gcc/config/rs6000/rs6000.c >> @@ -5454,17 +5454,23 @@ rs6000_update_target_cost_per_stmt (rs6000_cost_data >> *data, >> { >> tree vectype = STMT_VINFO_VECTYPE (stmt_info); >> unsigned int nunits = vect_nunits_for_cost (vectype); >> - unsigned int extra_cost = nunits * stmt_cost; >> - /* As function rs6000_builtin_vectorization_cost shows, we have >> - priced much on V16QI/V8HI vector construction as their units, >> - if we penalize them with nunits * stmt_cost, it can result in >> - an unreliable body cost, eg: for V16QI on Power8, stmt_cost >> - is 20 and nunits is 16, the extra cost is 320 which looks >> - much exaggerated. So let's use one maximum bound for the >> - extra penalized cost for vector construction here. */ >> - const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12; >> - if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR) >> - extra_cost = MAX_PENALIZED_COST_FOR_CTOR; >> + /* As function rs6000_builtin_vectorization_cost shows, we >> + have priced much on V16QI/V8HI vector construction by >> + considering their units, if we penalize them with nunits >> + * stmt_cost here, it can result in an unreliable body cost, > > This might be confusing to the reader, since you have deleted the calculation > of nunits * stmt_cost. Could you instead write this to indicate that we used > to adjust in this way, and it had this particular downside, so that's why > you're choosing this heuristic? It's a minor thing but I think people reading > the code will be confused otherwise. > Good point! I'll update the commentary to explain it, thanks!! BR, Kewen > I think the heuristic is generally reasonable, and certainly better than what > we had before! > > LGTM with adjusted commentary, so recommend maintainers approve. > > Thanks for the patch! > Bill >> + eg: for V16QI on Power8, stmt_cost is 20 and nunits is 16, >> + the penalty will be 320 which looks much exaggerated. But >> + there are actually nunits scalar loads, so we try to adopt >> + one reasonable penalized cost for each load rather than >> + stmt_cost. Here, with stmt_cost dividing by log2(nunits)^2, >> + we can still retain the necessary penalty for small nunits >> + meanwhile stabilize the penalty for big nunits. */ >> + int nunits_log2 = exact_log2 (nunits); >> + gcc_assert (nunits_log2 > 0); >> + unsigned int nunits_sq = nunits_log2 * nunits_log2; >> + unsigned int adjusted_cost = stmt_cost / nunits_sq; >> + gcc_assert (adjusted_cost > 0); >> + unsigned int extra_cost = nunits * adjusted_cost; >> data->extra_ctor_cost += extra_cost; >> } >> } >> -- >> 2.25.1 >