I am revisiting an effort to make the number of lanes for vector segment
load/store a tunable parameter.
A year ago, Robin added minimal and not-yet-tunable
common_vector_cost::segment_permute_[2-8]
But it is tunable, just not a param? :) We have our own cost structure in our
downstream repo, adjusted to our uarch. I suggest you do the same or upstream
a separate cost structure. I don't think anybody would object to having
several of those, one for each uarch (as long as they are sufficiently
distinct).
BTW, just tangentially related and I don't know how sensitive your uarch is to
scheduling, but with the x264 SAD and other sched issues we have seen you might
consider disabling sched1 as well for your uarch? I know that for our uarch we
want to keep it on but we surely could have another generic-like mtune option
that disables it (maybe even generic-ooo and change the current generic-ooo to
generic-in-order?). I would expect this to get more common in the future
anyway.
Some issues & questions:
* Since this pertains only to segment load/store, why is the word "permute"
in the name?
The vectorizer already performs costing for the segment loads/stores (IIRC as
simple loads, though). At some point the idea was to explicitly model the
"segment permute/transpose" as a separate operation i.e.
v0, v1, v2 = segmented_load3x3 (...)
{
load vtmp0;
load vtmp1;
load vtmp2;
v0 = {vtmp0[0], v1tmp[0], v2tmp[0]};
v1 = {vtmp0[1], v1tmp[1], v2tmp[1]};
v2 = {vtmp0[2], v1tmp[2], v2tmp[2]};
}
and that permute is the expensive part of the operation in 99% of the cases.
That's where the wording comes from.
* Nit: why are these defined as individual members rather than an array
referenced as segment_permute[NF-2]?
No real reason. I guess an array is preferable in several ways so feel free to
change that.
* I implemented tuning as a simple threshold for max NF where segment
load/store is profitable. Test cases for vector segment store pass, but
tests for load fail. I found that common_cost_vector::segment_permute is
properly honored in the store case, but not even inspected in the load
case. I will need to spelunk the autovec cost model. Clues are welcome.
Could you give an example for that? Might just be a bug.
Looking at gcc.target/riscv/rvv/autovec/struct/struct_vect-1.c, however I see
that the cost is adjusted for loads, though.
--
Regards
Robin