I am revisiting an effort to make the number of lanes for vector segment
load/store a tunable parameter.

A year ago, Robin added minimal and not-yet-tunable
common_vector_cost::segment_permute_[2-8]

But it is tunable, just not a param? :) We have our own cost structure in our downstream repo, adjusted to our uarch. I suggest you do the same or upstream a separate cost structure. I don't think anybody would object to having several of those, one for each uarch (as long as they are sufficiently distinct).

BTW, just tangentially related and I don't know how sensitive your uarch is to scheduling, but with the x264 SAD and other sched issues we have seen you might consider disabling sched1 as well for your uarch? I know that for our uarch we want to keep it on but we surely could have another generic-like mtune option that disables it (maybe even generic-ooo and change the current generic-ooo to generic-in-order?). I would expect this to get more common in the future anyway.

Some issues & questions:

* Since this pertains only to segment load/store, why is the word "permute"
  in the name?

The vectorizer already performs costing for the segment loads/stores (IIRC as simple loads, though). At some point the idea was to explicitly model the "segment permute/transpose" as a separate operation i.e.

v0, v1, v2 = segmented_load3x3 (...)
  {
    load vtmp0;
    load vtmp1;
    load vtmp2;
    v0 = {vtmp0[0], v1tmp[0], v2tmp[0]};
    v1 = {vtmp0[1], v1tmp[1], v2tmp[1]};
    v2 = {vtmp0[2], v1tmp[2], v2tmp[2]};
  }

and that permute is the expensive part of the operation in 99% of the cases.
That's where the wording comes from.

* Nit: why are these defined as individual members rather than an array
  referenced as segment_permute[NF-2]?

No real reason. I guess an array is preferable in several ways so feel free to change that.

* I implemented tuning as a simple threshold for max NF where segment
  load/store is profitable. Test cases for vector segment store pass, but
  tests for load fail. I found that common_cost_vector::segment_permute is
  properly honored in the store case, but not even inspected in the load
  case. I will need to spelunk the autovec cost model. Clues are welcome.

Could you give an example for that?  Might just be a bug.
Looking at gcc.target/riscv/rvv/autovec/struct/struct_vect-1.c, however I see that the cost is adjusted for loads, though.

--
Regards
Robin

Reply via email to