Re: Question about wrapv-vect-reduc-dot-s8b.c
>> To fix it, is it necessary to support 'vec_unpack' ? > > both same units would be sext, not vec_unpacks_{lo,hi} - the vectorizer > ties its hands by choosing vector types early and based on the number > of incoming/outgoing vectors it chooses one or the other method. > > More precise dumping would probably help here but somewhere earlier you > should be able to see the vector type used for _2 We usually try with a "normal" mode like VNx4SI (RVVM1SI or so) and then switch to VNx4QI (i.e. a mode that only determines the number of units/elements) and have vectorize_related_mode return modes with the same number of units. This will then result in the sext/zext patterns matching. The first round where we try the normal mode will not match those because the related mode has a different number of units. So it's somewhat expected that the first try fails. My dump shows that we vectorize, so IMHO no problem. I can take a look at this but it doesn't look like a case for pack/unpack. Regards Robin
Re: Question about wrapv-vect-reduc-dot-s8b.c
> it's target dependent what we choose first so it's going to be > a bit difficult to adjust testcases like this (and it looks like > a testsuite issue). I think for this specific testcase changing > scan-tree-dump-times to scan-tree-dump is reasonable. Note we > really want to check that for the case we choose finally > we use the sdot pattern, but I don't see how we can easily constrain > the dump-scans. Can we do sth like > "vect_recog_dot_prod_pattern: detected\n(!FAILED)*SUCCEEDED", thus > after the dot-prod pattern dumping allow arbitrary stuff but _not_ > a "failed" and then require a "succeeded"? > > The other way would be to somehow add a dump flag that produces > dumps only for the succeeded part. Of course we have targets that > evaluate multiple succeeded parts for costing (but luckily costing > is disabled for most tests). I'm going to have a try at fixing the test expectations but not before tomorrow. Right now I can see "vect_recog_widen_mult_pattern: detected" even four times in the dump, 2x for each try, so I first need to understand what's going on. Regards Robin
Re: Question about wrapv-vect-reduc-dot-s8b.c
>> I am wondering whether we do have some situations that >> vec_pack/vec_unpack/vec_widen_xxx/dot_prod pattern can be >> beneficial for RVV ? I have ever met some situation that vec_unpack >> can be beneficial when working on SELECT_VL but I don't which >> case > > With fixed size vectors you'll face the situation that the vectorizer > chooses the "wrong" vector type so yes, I think implementing > vec_unpack[s]_{lo,hi} might be useful. But I wouldn't prioritize this > until you have a more clear picture of how useful it would be. Another thing that comes to mind is that we currently don't do vectorizable calls with mismatched vector sizes. So even if we detected e.g. vec_widen_plus early it wouldn't get us much further. On the other hand, I don't think we perform many optimizations on such patterns between vect and combine (where we finally generate those). Regards Robin
Re: Question about wrapv-vect-reduc-dot-s8b.c
> the dump-scans. Can we do sth like > "vect_recog_dot_prod_pattern: detected\n(!FAILED)*SUCCEEDED", thus > after the dot-prod pattern dumping allow arbitrary stuff but _not_ > a "failed" and then require a "succeeded"? It took some fighting with tcl syntax until I arrived at the regex pattern below but it looks like it is possible. "vect_recog_dot_prod_pattern: detected(?:(?!failed).)*succeeded". This seems to work for the failing cases and I'm going to send a patch tomorrow if an x86 testsuite run is unchanged. Regards Robin
Re: gcc relies on RISC-V vcompress instruction undefined behaviour
> Hi, > > I think gcc is relying on undefined behaviour with the vcompress instruction. > Unfortunately my test case isn't reproducing on mainline, but gcc looks to > use the fields between the last mask selected field and vl while setting > tail agnostic. > > This thread explains how vcompress is different in that the tail starts after > the last mask selected field: > > https://github.com/riscvarchive/riscv-v-spec/issues/796 > > There was a bug in QEMU that I just fixed that prevented the all 1s tail > agnostic option (rvv_ta_all_1s) from poisoning these bits: > > https://lists.nongnu.org/archive/html/qemu-riscv/2024-10/msg00561.html > > With that fix, I see problems until I modify the previous setvli from ta to > tu. I think 9aabf81f40f0 ("RISC-V: Optimize permutation codegen with > compress") is one place we need to set tail undisturbed, but my fail is > from an earlier checkout so there must be another issue in the tree. > > I presume the fix is to force all these vcompress cases to set tail > undisturbed. > > Thanks, > Anton Could you send your test case so we can have a look? Or directly file a bug as Richi suggested. -- Regards Robin
Re: [RISC-V] vector segment load/store width as a riscv_tune_param
I am revisiting an effort to make the number of lanes for vector segment load/store a tunable parameter. A year ago, Robin added minimal and not-yet-tunable common_vector_cost::segment_permute_[2-8] But it is tunable, just not a param? :) We have our own cost structure in our downstream repo, adjusted to our uarch. I suggest you do the same or upstream a separate cost structure. I don't think anybody would object to having several of those, one for each uarch (as long as they are sufficiently distinct). BTW, just tangentially related and I don't know how sensitive your uarch is to scheduling, but with the x264 SAD and other sched issues we have seen you might consider disabling sched1 as well for your uarch? I know that for our uarch we want to keep it on but we surely could have another generic-like mtune option that disables it (maybe even generic-ooo and change the current generic-ooo to generic-in-order?). I would expect this to get more common in the future anyway. Some issues & questions: * Since this pertains only to segment load/store, why is the word "permute" in the name? The vectorizer already performs costing for the segment loads/stores (IIRC as simple loads, though). At some point the idea was to explicitly model the "segment permute/transpose" as a separate operation i.e. v0, v1, v2 = segmented_load3x3 (...) { load vtmp0; load vtmp1; load vtmp2; v0 = {vtmp0[0], v1tmp[0], v2tmp[0]}; v1 = {vtmp0[1], v1tmp[1], v2tmp[1]}; v2 = {vtmp0[2], v1tmp[2], v2tmp[2]}; } and that permute is the expensive part of the operation in 99% of the cases. That's where the wording comes from. * Nit: why are these defined as individual members rather than an array referenced as segment_permute[NF-2]? No real reason. I guess an array is preferable in several ways so feel free to change that. * I implemented tuning as a simple threshold for max NF where segment load/store is profitable. Test cases for vector segment store pass, but tests for load fail. I found that common_cost_vector::segment_permute is properly honored in the store case, but not even inspected in the load case. I will need to spelunk the autovec cost model. Clues are welcome. Could you give an example for that? Might just be a bug. Looking at gcc.target/riscv/rvv/autovec/struct/struct_vect-1.c, however I see that the cost is adjusted for loads, though. -- Regards Robin
Re: [RISC-V] vector segment load/store width as a riscv_tune_param
You won't see failures in the testsuite. The failures only show-up when I attempt to impose huge costs on NF above threshold. A quick & dirty way to expose the bug is apply the appended patch, then observe that you get output from this only for mask_struct_store-*.c and not for mask_struct_load-*.c I suppose that's due to Richi's restructuring of the vector/SLP code. What might work is (untested): diff --git a/gcc/config/riscv/riscv-vector-costs.cc b/gcc/config/riscv/riscv-vector-costs.cc index 167375ca751..3c8ff760a4e 100644 --- a/gcc/config/riscv/riscv-vector-costs.cc +++ b/gcc/config/riscv/riscv-vector-costs.cc @@ -1114,7 +1114,7 @@ segment_loadstore_group_size (enum vect_cost_for_stmt kind, unsigned costs::adjust_stmt_cost (enum vect_cost_for_stmt kind, loop_vec_info loop, stmt_vec_info stmt_info, -slp_tree, tree vectype, int stmt_cost) +slp_tree node, tree vectype, int stmt_cost) { const cpu_vector_cost *costs = get_vector_costs (); switch (kind) @@ -1138,6 +1138,8 @@ costs::adjust_stmt_cost (enum vect_cost_for_stmt kind, loop_vec_info loop, costs for each. */ /* TODO: Indexed and ordered/unordered cost. */ int group_size = segment_loadstore_group_size (kind, stmt_info); + if (!group_size && node->ldst_lanes) + group_size = node->lanes; if (group_size > 1) { switch (group_size) -- Regards Robin