On 6/9/23 04:41, juzhe.zh...@rivai.ai wrote:
From: Juzhe-Zhong <juzhe.zh...@rivai.ai>

This patch is to rework Phase 5 && Phase 6 of VSETVL PASS since Phase 5 && 
Phase 6
are quite messy and cause some bugs discovered by my downstream 
auto-vectorization
test-generator.

Before this patch.

Phase 5 is cleanup_insns is the function remove AVL operand dependency from 
each RVV instruction.
E.g. vadd.vv (use a5), after Phase 5, ====> vadd.vv (use const_int 0). Since "a5" is used 
in "vsetvl" instructions and
after the correct "vsetvl" instructions are inserted, each RVV instruction doesn't need 
AVL operand "a5" anymore. Then,
we remove this operand dependency helps for the following scheduling PASS.
Right. Removal of the unused operand gives the scheduler more freedom. It's not clear yet how much gain there is for scheduling vector on RV, but there's no good reason to handcuff it with unnecessary dependencies.



Phase 6 is propagate_avl do the following 2 things:
1. Local && Global user vsetvl instructions optimization.
    E.g.
       vsetvli a2, a2, e8, mf8   ======> Change it into vsetvli a2, a2, e32, mf2
       vsetvli zero,a2, e32, mf2  ======> eliminate
Always good to eliminate more instructions. So while vsetvl is designed to be minimal overhead and it's fully expected that we'll see a lot of them, there's no good reason to have unnnecessary ones in the stream.


2. Optimize user vsetvl from "vsetvl a2,a2" into "vsetvl zero,a2" if "a2" is 
not used by any instructions.
Since from Phase 1 ~ Phase 4 which inserts "vsetvli" instructions base on LCM 
which change the CFG, I re-new a new
RTL_SSA framework (which is more expensive than just using DF) for Phase 6 and 
optmize user vsetvli base on the new RTL_SSA.
This one isn't as clear cut, but I still think it's the right thing to do. The first form explicitly kills the value in a2 while the second does not. Though if the value is dead it's going to be discoverable by DF and we should also end up with REG_DEAD note as well. It does have the advantage that it does not open a new live range.


There are 2 issues in Phase 5 && Phase 6:
1. local_eliminate_vsetvl_insn was introduced by @kito which can do better 
local user vsetvl optimizations better than
    Phase 6 do, such approach doesn't need to re-new the RTL_SSA framework. So 
the local user vsetvli instructions optimizaiton
    in Phase 6 is redundant and should be removed.
2. A bug discovered by my downstream auto-vectorization test-generator (I can't 
put the test in this patch since we are missing autovec
    patterns for it so we can't use the upstream GCC directly reproduce such 
issue but I will remember put it back after I support the
    necessary autovec patterns). Such bug is causing by using RTL_SSA re-new 
framework. The issue description is this:
Note that you could potentially go ahead and submit that test and just xfail it. Not a requirement, but a possibility that I sometimes use if I know I've got a fix coming shortly.


Before Phase 6:
    ...
    insn1: vsetlvi a3, 17 <========== generated by SELECT_VL auto-vec pattern.
    slli a4,a3,3
    ...
    insn2: vsetvli zero, a3, ...
    load (use const_int 0, before Phase 5, it's using a3, but the use of "a3" 
is removed in Phase 5)
    ...

In Phase 6, we iterate to insn2, then get the def of "a3" which is the insn1.
insn2 is the vsetvli instruction inserted in Phase 4 which is not included in 
the RLT_SSA framework
even though we renew it (I didn't take a look at it and I don't think we need 
to now).
Base on this situation, the def_info of insn2 has the information 
"set->single_nondebug_insn_use ()"
which return true. Obviously, this information is not correct, since insn1 has 
aleast 2 uses:
1). slli a4,a3,3 2).insn2: vsetvli zero, a3, ... Then, the test generated by my 
downstream test-generator
execution test failed.
Understood.


Conclusion of RTL_SSA framework:
Before this patch, we initialize RTL_SSA 2 times. One is at the beginning of 
the VSETVL PASS which is absolutely correct, the other
is re-new after Phase 4 (LCM) has incorrect information that causes bugs.

Besides, we don't like to initialize RTL_SSA second time it seems to be a waste 
since we just need to do a little optimization.

Base on all circumstances I described above, I rework and reorganize Phase 5 && 
Phase 6 as follows:
1. Phase 5 is called ssa_post_optimization which is doing the optimization base 
on the RTL_SSA information (The RTL_SSA is initialized
    at the beginning of the VSETVL PASS, no need to re-new it again). This 
phase includes 3 optimizaitons:
    1). local_eliminate_vsetvl_insn we already have (no change).
    2). global_eliminate_vsetvl_insn ---> new optimizaiton splitted from 
orignal Phase 6 but with more powerful and reliable implementation.
       E.g.
       void f(int8_t *base, int8_t *out, size_t vl, size_t m, size_t k) {
         size_t avl;
         if (m > 100)
           avl = __riscv_vsetvl_e16mf4(vl << 4);
         else
           avl = __riscv_vsetvl_e32mf2(vl >> 8);
         for (size_t i = 0; i < m; i++) {
           vint8mf8_t v0 = __riscv_vle8_v_i8mf8(base + i, avl);
           v0 = __riscv_vadd_vv_i8mf8 (v0, v0, avl);
           __riscv_vse8_v_i8mf8(out + i, v0, avl);
         }
       }

       This example failed to global user vsetvl optimize before this patch:
       f:
               li      a5,100
               bleu    a3,a5,.L2
               slli    a2,a2,4
               vsetvli a4,a2,e16,mf4,ta,mu
       .L3:
               li      a5,0
               vsetvli zero,a4,e8,mf8,ta,ma
       .L5:
               add     a6,a0,a5
               add     a2,a1,a5
               vle8.v  v1,0(a6)
               addi    a5,a5,1
               vadd.vv v1,v1,v1
               vse8.v  v1,0(a2)
               bgtu    a3,a5,.L5
       .L10:
               ret
       .L2:
               beq     a3,zero,.L10
               srli    a2,a2,8
               vsetvli a4,a2,e32,mf2,ta,mu
               j       .L3
       With this patch:
       f:
               li      a5,100
               bleu    a3,a5,.L2
               slli    a2,a2,4
               vsetvli zero,a2,e8,mf8,ta,ma
       .L3:
               li      a5,0
       .L5:
               add     a6,a0,a5
               add     a2,a1,a5
               vle8.v  v1,0(a6)
               addi    a5,a5,1
               vadd.vv v1,v1,v1
               vse8.v  v1,0(a2)
               bgtu    a3,a5,.L5
       .L10:
               ret
       .L2:
               beq     a3,zero,.L10
               srli    a2,a2,8
               vsetvli zero,a2,e8,mf8,ta,ma
               j       .L3

    3). Remove AVL operand dependency of each RVV instructions.

2. Phase 6 is called df_post_optimization: Optimize "vsetvl a3,a2...." into Optimize 
"vsetvl zero,a2...." base on
    dataflow analysis of new CFG (new CFG is created by LCM). The reason we 
need to do use new CFG and after Phase 5:
    ...
    vsetvl a3, a2...
    vadd.vv (use a3)
    If we don't have Phase 5 which removes the "a3" use in vadd.vv, we will 
fail to optimize vsetvl a3,a2 into vsetvl zero,a2.
This patch passed all tests in rvv.exp with ONLY peformance && codegen improved (no performance decline and no bugs including my
    downstream tests).

gcc/ChangeLog:

         * config/riscv/riscv-vsetvl.cc (available_occurrence_p): Ehance user 
vsetvl optimization.
         (vector_insn_info::parse_insn): Add rtx_insn parse.
         (pass_vsetvl::local_eliminate_vsetvl_insn): Ehance user vsetvl 
optimization.
         (get_first_vsetvl): New function.
         (pass_vsetvl::global_eliminate_vsetvl_insn): Ditto.
         (pass_vsetvl::cleanup_insns): Remove it.
         (pass_vsetvl::ssa_post_optimization): New function.
         (has_no_uses): Ditto.
         (pass_vsetvl::propagate_avl): Remove it.
         (pass_vsetvl::df_post_optimization): New function.
         (pass_vsetvl::lazy_vsetvl): Rework Phase 5 && Phase 6.
         * config/riscv/riscv-vsetvl.h: Adapt declaration.

gcc/testsuite/ChangeLog:

         * gcc.target/riscv/rvv/vsetvl/vsetvl-16.c: Adapt test.
         * gcc.target/riscv/rvv/vsetvl/vsetvl-2.c: Ditto.
         * gcc.target/riscv/rvv/vsetvl/vsetvl-3.c: Ditto.
         * gcc.target/riscv/rvv/vsetvl/vsetvl-21.c: New test.
         * gcc.target/riscv/rvv/vsetvl/vsetvl-22.c: New test.
         * gcc.target/riscv/rvv/vsetvl/vsetvl-23.c: New test.
s/Ehance/Enhance/

I would probably have suggested this get broken down into smaller chunks. I think you've got multiple things going on in this patch. I realize there may be some interdependencies, but they can often be dealt with.



@@ -4277,27 +4285,187 @@ pass_vsetvl::local_eliminate_vsetvl_insn (const 
bb_info *bb) const
      }
  }
-/* Before VSETVL PASS, RVV instructions pattern is depending on AVL operand
-   implicitly. Since we will emit VSETVL instruction and make RVV instructions
-   depending on VL/VTYPE global status registers, we remove the such AVL 
operand
-   in the RVV instructions pattern here in order to remove AVL dependencies 
when
-   AVL operand is a register operand.
-
-   Before the VSETVL PASS:
-     li a5,32
-     ...
-     vadd.vv (..., a5)
-   After the VSETVL PASS:
-     li a5,32
-     vsetvli zero, a5, ...
-     ...
-     vadd.vv (..., const_int 0).  */
+/* Get the first vsetvl instructions of the block.  */
I'd adjust the comment a bit, perhaps something like this:

/* Return the first vsetvl instruction in CFG_BB or NULL if
   none exists or if a user RVV instruction is enountered
   prior to any vsetvl.  */

+static rtx_insn *
+get_first_vsetvl (basic_block cfg_bb)
I'd probably adjust the name as well. There's an important exception to returning the first vsetvl -- you stop the search if you encounter a user RVV instruction.



+bool
+pass_vsetvl::global_eliminate_vsetvl_insn (const bb_info *bb) const
+{
[ ... ]
+
+  /* No need to optimize if block doesn't have vsetvl instructions.  */
+  if (!dem.valid_or_dirty_p () || !vsetvl_rinsn || !dem.get_avl_source ()
+      || !dem.has_avl_reg ())
+    return false;
It is considered best practice to test the cheapest conditional first (within the constraints of correctness). So I probably would have checked !vsetvl_rinsn first. Resulting in

  if (!vsetvl_rinsn || !dem.valid_or_dirty_p ()
      || !dem.get_avl_source () || !dem.has_avl_reg ())

Or

  if (!vsetvl_rinsn
      || !dem.valid_or_dirty_p ()
      || !dem.get_avl_source ()
      || !dem.has_avl_reg ())


The formatting in this case is more a personal preference. So don't consider changing the formatting to be a requirement to move forward.


+
+  /* If all preds has VL/VTYPE status setted by user vsetvls, and these
+     user vsetvls are all skip_avl_compatible_p with the vsetvl in this
+     block, we can eliminate this vsetvl instruction.  */
+  sbitmap avin = m_vector_manager->vector_avin[cfg_bb->index];
+
+  unsigned int bb_index;
+  sbitmap_iterator sbi;
+  rtx avl = get_avl (dem.get_insn ()->rtl ());
+  hash_set<set_info *> sets
+    = get_all_sets (dem.get_avl_source (), true, false, false);
+  /* Condition 1: All VL/VTYPE available in are all compatible.  */
+  EXECUTE_IF_SET_IN_BITMAP (avin, 0, bb_index, sbi)
+    {
+      const auto &expr = m_vector_manager->vector_exprs[bb_index];
+      const auto &insn = expr->get_insn ();
+      def_info *def = find_access (insn->defs (), REGNO (avl));
+      set_info *set = safe_dyn_cast<set_info *> (def);
+      if (!vsetvl_insn_p (insn->rtl ()) || insn->bb () == bb
+         || !sets.contains (set))
+       return false;
+    }
+
+  /* Condition 2: Check it has preds.  */
+  if (EDGE_COUNT (cfg_bb->preds) == 0)
+    return false;
Not a big deal, but under what circumstances are we running into blocks with no predecessors? The only block that should have that property is the entry block. Similarly if you have no preds, then ISTM that avin will always be empty. So if we can validly have a block with no preds, then shouldn't this check go before walking AVIN just from a compile-time standpoint?




@@ -4342,135 +4510,81 @@ pass_vsetvl::cleanup_insns (void) const
      }
  }
+/* Return true if the SET result is not used by any instructions. */
+static bool
+has_no_uses (basic_block cfg_bb, rtx_insn *rinsn, int regno)
+{
+  /* Handle the following case that can not be detected in RTL_SSA.  */
+  /* E.g.
+         li a5, 100
+         vsetvli a6, a5...
+         ...
+         vadd (use a6)
+
+       The use of "a6" is removed from "vadd" but the information is
+       not updated in RTL_SSA framework. We don't want to re-new
+       a new RTL_SSA which is expensive, instead, we use data-flow
+       analysis to check whether "a6" has no uses.  */
I'm a bit surprised there wasn't a reasonable way to update the RTL SSA framework for this case. If we were to remove the entire vadd, then we would have to update the uses of a6. If we have that capability, then I would expect we could refactor the updating code so that we had an API to remove an operand from an instruction.

In fact, if we have a constant propagator in the RTL SSA framework, wouldn't it have to have this capability?

I'm not objecting to what you've done at this time, but it seems like a better way might be possible. So the ask is to review the RTL SSA code to see if there's reasonable building blocks to do what you want.


Overall it looks pretty good. The biggest concern is the change to use DF use information rather than the RTL SSA framework. That may ultimately be a reasonable thing to do, but I'd like you to confirm that we don't have the right building blocks in the RTL SSA framework to do the incremental update you seem to need.

Thanks,
jeff

Reply via email to