On 9/9/21 11:11 AM, Segher Boessenkool wrote:
Hi!

On Wed, Sep 08, 2021 at 02:57:14PM +0800, Kewen.Lin wrote:
+      /* If we have strided or elementwise loads into a vector, it's
+        possible to be bounded by latency and execution resources for
+        many scalar loads.  Try to account for this by scaling the
+        construction cost by the number of elements involved, when
+        handling each matching statement we record the possible extra
+        penalized cost into target cost, in the end of costing for
+        the whole loop, we do the actual penalization once some load
+        density heuristics are satisfied.  */
The above comment is quite hard to read.  Can you please break up the last
sentence into at least two sentences?
How about the below:

+      /* If we have strided or elementwise loads into a vector, it's
"strided" is not a word: it properly is "stridden", which does not read
very well either.  "Have loads by stride, or by element, ..."?  Is that
good English, and easier to understand?

No, this is OK.  "Strided loads" is a term of art used by the vectorizer; whether or not it was the Queen's English, it's what we have...  (And I think you might only find "bestridden" in some 18th or 19th century English poetry... :-)

+        possible to be bounded by latency and execution resources for
+        many scalar loads.  Try to account for this by scaling the
+        construction cost by the number of elements involved.  For
+        each matching statement, we record the possible extra
+        penalized cost into the relevant field in target cost.  When
+        we want to finalize the whole loop costing, we will check if
+        those related load density heuristics are satisfied, and add
+        this accumulated penalized cost if yes.  */

Otherwise this looks good to me, and I recommend maintainers approve with
that clarified.
Does that text look good to you now Bill?  It is still kinda complex,
maybe you see a way to make it simpler.

I think it's OK now.  The complexity at least matches the code now instead of exceeding it. :-P  j/k...


        * config/rs6000/rs6000.c (struct rs6000_cost_data): New members
        nstmts, nloads and extra_ctor_cost.
        (rs6000_density_test): Add load density related heuristics and the
        checks, do extra costing on vector construction statements if need.
"and the checks"?  Oh, "and checks"?  It is probably fine to just leave
out this whole phrase part :-)

Don't use commas like this in changelogs.  s/, do/.  Do/  Yes this is a
bit boring text that way, but that is the purpose: it makes it simpler
to read (and read quickly, even merely scan).

@@ -5262,6 +5262,12 @@ typedef struct _rs6000_cost_data
[ Btw, you can get rid of the typedef now, just have a struct with the
non-underscore name, we have C++ now.  Such a mechanical change (as
separate patch!) is pre-approved. ]

+  /* Check if we need to penalize the body cost for latency and
+     execution resources bound from strided or elementwise loads
+     into a vector.  */
Bill, is that clear enough?  I'm sure something nicer would help here,
but it's hard for me to write anything :-)

Perhaps:  "Check whether we need to penalize the body cost to account for excess strided or elementwise loads."

+  if (data->extra_ctor_cost > 0)
+    {
+      /* Threshold for load stmts percentage in all vectorized stmts.  */
+      const int DENSITY_LOAD_PCT_THRESHOLD = 45;
Threshold for what?

45% is awfully exact.  Can you make this a param?

+      /* Threshold for total number of load stmts.  */
+      const int DENSITY_LOAD_NUM_THRESHOLD = 20;
Same.


We have similar magic constants in here already.  Parameterizing is possible, but I'm more interested in making sure the numbers are appropriate for each processor.  Given that Kewen reports they work well for both P9 and P10, I'm pretty happy with what we have here.  (Kewen, thanks for running the P10 experiments!)

Perhaps a follow-up patch to add params for the magic constants would be reasonable, but I'd personally consider it pretty low priority.


+      unsigned int load_pct = (data->nloads * 100) / (data->nstmts);
No parens around the last thing please.  The other pair of parens is
unneeded as well, but perhaps it is easier to read like that.

+         if (dump_enabled_p ())
+           dump_printf_loc (MSG_NOTE, vect_location,
+                            "Found %u loads and load pct. %u%% exceed "
+                            "the threshold, penalizing loop body "
+                            "cost by extra cost %u for ctor.\n",
+                            data->nloads, load_pct, data->extra_ctor_cost);
That line does not fit.  Make it more lines?

It is a pity that using these interfaces at all takes up 45 chars
of noise already.

+/* Helper function for add_stmt_cost.  Check each statement cost
+   entry, gather information and update the target_cost fields
+   accordingly.  */
+static void
+rs6000_update_target_cost_per_stmt (rs6000_cost_data *data,
+                                   enum vect_cost_for_stmt kind,
+                                   struct _stmt_vec_info *stmt_info,
+                                   enum vect_cost_model_location where,
+                                   int stmt_cost, unsigned int orig_count)
Please put those last two on separate lines as well?

+         /* As function rs6000_builtin_vectorization_cost shows, we have
+            priced much on V16QI/V8HI vector construction as their units,
+            if we penalize them with nunits * stmt_cost, it can result in
+            an unreliable body cost, eg: for V16QI on Power8, stmt_cost
+            is 20 and nunits is 16, the extra cost is 320 which looks
+            much exaggerated.  So let's use one maximum bound for the
+            extra penalized cost for vector construction here.  */
+         const unsigned int MAX_PENALIZED_COST_FOR_CTOR = 12;
+         if (extra_cost > MAX_PENALIZED_COST_FOR_CTOR)
+           extra_cost = MAX_PENALIZED_COST_FOR_CTOR;
That is a pretty gross hack.  Can you think of any saner way to not have
those out of scale costs in the first place?

In Kewen's defense, the whole business of "finish_cost" for these vectorized loops is to tweak things that don't work quite right with the hooks currently provided to the vectorizer to add costs on a per-stmt basis without looking at the overall set of statements.  It gives the back end a chance to massage things and exercise veto power over otherwise bad decisions.  By nature, that's going to be very much a heuristic exercise.  Personally I think the heuristics used here are pretty reasonable, and importantly they are designed to only be employed in pretty rare circumstances.  It doesn't look easy to me to avoid the need for a cap here without making the rest of the heuristics harder to understand.  But sure, he can try! :)

Kewen, thanks for the updates!

Bill


Okay for trunk with such tweaks.  Thanks!  (And please consult with Bill
for the wordsmithing :-) )


Segher

Reply via email to