On Tue, Mar 25, 2025 at 2:47 AM Robin Dapp <[email protected]> wrote:
> > A year ago, Robin added minimal and not-yet-tunable
> > common_vector_cost::segment_permute_[2-8]
>
> But it is tunable, just not a param? :)
I meant "param" generically, not necessarily a command-line --param=thingy,
though point taken! :)
> We have our own cost structure in our
> downstream repo, adjusted to our uarch. I suggest you do the same or
> upstream
> a separate cost structure. I don't think anybody would object to having
> several of those, one for each uarch (as long as they are sufficiently
> distinct).
>
Yes, this is what I meant by not-yet-tunable, there is currently no datapath
between -mcpu/-mtune and common_vector_cost::segment_permute_*. All CPUs get
the same hard-coded value of 1 for all segment_permute_* costs.
> BTW, just tangentially related and I don't know how sensitive your uarch
> is to
> scheduling, but with the x264 SAD and other sched issues we have seen you
> might
> consider disabling sched1 as well for your uarch? I know that for our
> uarch we
> want to keep it on but we surely could have another generic-like mtune
> option
> that disables it (maybe even generic-ooo and change the current
> generic-ooo to
> generic-in-order?). I would expect this to get more common in the future
> anyway.
Thanks for the tip. We will look into it.
> > Some issues & questions:
> >
> > * Since this pertains only to segment load/store, why is the word
> "permute"
> > in the name?
>
> The vectorizer already performs costing for the segment loads/stores (IIRC
> as
> simple loads, though). At some point the idea was to explicitly model the
> "segment permute/transpose" as a separate operation i.e.
>
This is a different concept, so I ought to introduce a new cost param which
is
the threshold value of NF for fast vs. slow.
> * I implemented tuning as a simple threshold for max NF where segment
> > load/store is profitable. Test cases for vector segment store pass, but
> > tests for load fail. I found that common_cost_vector::segment_permute
> is
> > properly honored in the store case, but not even inspected in the load
> > case. I will need to spelunk the autovec cost model. Clues are welcome.
>
> Could you give an example for that? Might just be a bug.
> Looking at gcc.target/riscv/rvv/autovec/struct/struct_vect-1.c, however I
> see
> that the cost is adjusted for loads, though.
You won't see failures in the testsuite. The failures only show-up when I
attempt to impose huge costs on NF above threshold. A quick & dirty way to
expose the bug is apply the appended patch, then observe that you get output
from this only for mask_struct_store-*.c and not for mask_struct_load-*.c
G
--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -1140,6 +1140,7 @@ costs::adjust_stmt_cost (enum vect_cost_for_stmt
kind, loop_vec_info loop,
int group_size = segment_loadstore_group_size (kind,
stmt_info);
if (group_size > 1)
{
+ fprintf (stderr, "segment_loadstore_group_size = %d\n",
group_size);
switch (group_size)
{
case 2: