Re: Question about wrapv-vect-reduc-dot-s8b.c

2023-08-30 Thread Robin Dapp via Gcc
>> To fix it, is it necessary to support 'vec_unpack' ?
> 
> both same units would be sext, not vec_unpacks_{lo,hi} - the vectorizer
> ties its hands by choosing vector types early and based on the number
> of incoming/outgoing vectors it chooses one or the other method.
> 
> More precise dumping would probably help here but somewhere earlier you
> should be able to see the vector type used for _2
We usually try with a "normal" mode like VNx4SI (RVVM1SI or so) and
then switch to VNx4QI (i.e. a mode that only determines the number of
units/elements) and have vectorize_related_mode return modes with the
same number of units.  This will then result in the sext/zext patterns
matching.  The first round where we try the normal mode will not match
those because the related mode has a different number of units.

So it's somewhat expected that the first try fails.

My dump shows that we vectorize, so IMHO no problem.  I can take a look
at this but it doesn't look like a case for pack/unpack.  

Regards
 Robin


Re: Question about wrapv-vect-reduc-dot-s8b.c

2023-08-30 Thread Robin Dapp via Gcc
> it's target dependent what we choose first so it's going to be
> a bit difficult to adjust testcases like this (and it looks like
> a testsuite issue).  I think for this specific testcase changing
> scan-tree-dump-times to scan-tree-dump is reasonable.  Note we
> really want to check that for the case we choose finally
> we use the sdot pattern, but I don't see how we can easily constrain
> the dump-scans.  Can we do sth like
> "vect_recog_dot_prod_pattern: detected\n(!FAILED)*SUCCEEDED", thus
> after the dot-prod pattern dumping allow arbitrary stuff but _not_
> a "failed" and then require a "succeeded"?
> 
> The other way would be to somehow add a dump flag that produces
> dumps only for the succeeded part.  Of course we have targets that
> evaluate multiple succeeded parts for costing (but luckily costing
> is disabled for most tests).

I'm going to have a try at fixing the test expectations but not before
tomorrow.  Right now I can see
 "vect_recog_widen_mult_pattern: detected"
even four times in the dump, 2x for each try, so I first need to
understand what's going on.

Regards
 Robin


Re: Question about wrapv-vect-reduc-dot-s8b.c

2023-08-30 Thread Robin Dapp via Gcc


>> I am wondering whether we do have some situations that
>> vec_pack/vec_unpack/vec_widen_xxx/dot_prod pattern can be
>> beneficial for RVV ? I have ever met some situation that vec_unpack
>> can be beneficial when working on SELECT_VL but I don't which
>> case
> 
> With fixed size vectors you'll face the situation that the vectorizer
> chooses the "wrong" vector type so yes, I think implementing
> vec_unpack[s]_{lo,hi} might be useful.  But I wouldn't prioritize this
> until you have a more clear picture of how useful it would be.

Another thing that comes to mind is that we currently don't do
vectorizable calls with mismatched vector sizes.  So even if we detected
e.g. vec_widen_plus early it wouldn't get us much further.
On the other hand, I don't think we perform many optimizations on such
patterns between vect and combine (where we finally generate those).

Regards
 Robin



Re: Question about wrapv-vect-reduc-dot-s8b.c

2023-08-30 Thread Robin Dapp via Gcc
> the dump-scans.  Can we do sth like
> "vect_recog_dot_prod_pattern: detected\n(!FAILED)*SUCCEEDED", thus
> after the dot-prod pattern dumping allow arbitrary stuff but _not_
> a "failed" and then require a "succeeded"?

It took some fighting with tcl syntax until I arrived at the regex
pattern below but it looks like it is possible.

 "vect_recog_dot_prod_pattern: detected(?:(?!failed).)*succeeded".

This seems to work for the failing cases and I'm going to send a patch
tomorrow if an x86 testsuite run is unchanged.

Regards
 Robin


Re: gcc relies on RISC-V vcompress instruction undefined behaviour

2024-10-31 Thread Robin Dapp via Gcc
> Hi,
>
> I think gcc is relying on undefined behaviour with the vcompress instruction.
> Unfortunately my test case isn't reproducing on mainline, but gcc looks to
> use the fields between the last mask selected field and vl while setting
> tail agnostic.
>
> This thread explains how vcompress is different in that the tail starts after
> the last mask selected field:
>
> https://github.com/riscvarchive/riscv-v-spec/issues/796
>
> There was a bug in QEMU that I just fixed that prevented the all 1s tail
> agnostic option (rvv_ta_all_1s) from poisoning these bits:
>
> https://lists.nongnu.org/archive/html/qemu-riscv/2024-10/msg00561.html
>
> With that fix, I see problems until I modify the previous setvli from ta to
> tu. I think 9aabf81f40f0 ("RISC-V: Optimize permutation codegen with
> compress") is one place we need to set tail undisturbed, but my fail is
> from an earlier checkout so there must be another issue in the tree.
>
> I presume the fix is to force all these vcompress cases to set tail
> undisturbed.
>
> Thanks,
> Anton


Could you send your test case so we can have a look?  Or directly file a bug
as Richi suggested.

-- 
Regards
 Robin



Re: [RISC-V] vector segment load/store width as a riscv_tune_param

2025-03-25 Thread Robin Dapp via Gcc

I am revisiting an effort to make the number of lanes for vector segment
load/store a tunable parameter.

A year ago, Robin added minimal and not-yet-tunable
common_vector_cost::segment_permute_[2-8]


But it is tunable, just not a param? :)  We have our own cost structure in our 
downstream repo, adjusted to our uarch.  I suggest you do the same or upstream 
a separate cost structure.  I don't think anybody would object to having 
several of those, one for each uarch (as long as they are sufficiently 
distinct).


BTW, just tangentially related and I don't know how sensitive your uarch is to 
scheduling, but with the x264 SAD and other sched issues we have seen you might 
consider disabling sched1 as well for your uarch?  I know that for our uarch we 
want to keep it on but we surely could have another generic-like mtune option 
that disables it (maybe even generic-ooo and change the current generic-ooo to 
generic-in-order?).  I would expect this to get more common in the future 
anyway.



Some issues & questions:

* Since this pertains only to segment load/store, why is the word "permute"
  in the name?


The vectorizer already performs costing for the segment loads/stores (IIRC as 
simple loads, though).  At some point the idea was to explicitly model the 
"segment permute/transpose" as a separate operation i.e.


v0, v1, v2 = segmented_load3x3 (...)
  {
load vtmp0;
load vtmp1;
load vtmp2;
v0 = {vtmp0[0], v1tmp[0], v2tmp[0]};
v1 = {vtmp0[1], v1tmp[1], v2tmp[1]};
v2 = {vtmp0[2], v1tmp[2], v2tmp[2]};
  }

and that permute is the expensive part of the operation in 99% of the cases.
That's where the wording comes from.


* Nit: why are these defined as individual members rather than an array
  referenced as segment_permute[NF-2]?


No real reason.  I guess an array is preferable in several ways so feel free to 
change that.



* I implemented tuning as a simple threshold for max NF where segment
  load/store is profitable. Test cases for vector segment store pass, but
  tests for load fail. I found that common_cost_vector::segment_permute is
  properly honored in the store case, but not even inspected in the load
  case. I will need to spelunk the autovec cost model. Clues are welcome.


Could you give an example for that?  Might just be a bug.
Looking at gcc.target/riscv/rvv/autovec/struct/struct_vect-1.c, however I see 
that the cost is adjusted for loads, though.


--
Regards
Robin



Re: [RISC-V] vector segment load/store width as a riscv_tune_param

2025-03-26 Thread Robin Dapp via Gcc

You won't see failures in the testsuite. The failures only show-up when I
attempt to impose huge costs on NF above threshold. A quick & dirty way to
expose the bug is apply the appended patch, then observe that you get output
from this only for mask_struct_store-*.c and not for mask_struct_load-*.c


I suppose that's due to Richi's restructuring of the vector/SLP code.  What 
might work is (untested):


diff --git a/gcc/config/riscv/riscv-vector-costs.cc 
b/gcc/config/riscv/riscv-vector-costs.cc
index 167375ca751..3c8ff760a4e 100644
--- a/gcc/config/riscv/riscv-vector-costs.cc
+++ b/gcc/config/riscv/riscv-vector-costs.cc
@@ -1114,7 +1114,7 @@ segment_loadstore_group_size (enum vect_cost_for_stmt 
kind,
unsigned
costs::adjust_stmt_cost (enum vect_cost_for_stmt kind, loop_vec_info loop,
stmt_vec_info stmt_info,
-slp_tree, tree vectype, int stmt_cost)
+slp_tree node, tree vectype, int stmt_cost)
{
  const cpu_vector_cost *costs = get_vector_costs ();
  switch (kind)
@@ -1138,6 +1138,8 @@ costs::adjust_stmt_cost (enum vect_cost_for_stmt kind, 
loop_vec_info loop,
costs for each.  */
 /* TODO: Indexed and ordered/unordered cost.  */
 int group_size = segment_loadstore_group_size (kind, stmt_info);
+ if (!group_size && node->ldst_lanes)
+   group_size = node->lanes;
 if (group_size > 1)
   {
  switch (group_size)


--
Regards
Robin