Hi, Richard.
Assertion failed at this IR:
_427 = _425 & _426;
_429 = present$0_16(D) != 0;
_430 = _425 & _429;
_409 = _430 | _445;
_410 = _409 | _449;
_411 = .LOOP_VECTORIZED (3, 6);
if (_411 != 0)
goto <bb 267>; [100.00%]
else
goto <bb 268>; [100.00%]
<bb 267> [local count: 3280550]:
<bb 83> [local count: 29823181]:
....
pretmp_56 = .MASK_LOAD (_293, 32B, _427); ---> cause assertion failed.
You can take a look this ifcvt IR and search 'pretmp_56 = .MASK_LOAD (_293,
32B, _427);'
RVV is totally the same IR as ARM SVE:
https://godbolt.org/z/rPbzfExWP
I was struggling at this issue for a few days but failed to figure out why.
I am sorry about that. Could you help me with that?
And I adjust the code as follows:
diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
index a9200767f67..42f85839c6e 100644
--- a/gcc/tree-vect-stmts.cc
+++ b/gcc/tree-vect-stmts.cc
@@ -9843,10 +9843,17 @@ vectorizable_load (vec_info *vinfo,
mask_index = internal_fn_mask_index (ifn);
if (mask_index >= 0 && slp_node)
mask_index = vect_slp_child_index_for_operand (call, mask_index);
+ slp_tree slp_op = NULL;
if (mask_index >= 0
&& !vect_check_scalar_mask (vinfo, stmt_info, slp_node, mask_index,
- &mask, NULL, &mask_dt, &mask_vectype))
+ &mask,
+ ifn == IFN_MASK_LEN_GATHER_LOAD ? &slp_op
+ : NULL,
+ &mask_dt, &mask_vectype))
return false;
+ if (mask_index >= 0 && slp_node
+ && !vect_maybe_update_slp_op_vectype (slp_op, mask_vectype))
+ gcc_unreachable ();
}
As you can see, except MASK_LEN_GATHER_LOAD, other LOADs, I pass 'NULL' same as
before.
Only MASK_LEN_GATHER_LOAD passes '&slp_op'. It works fine for RVV. But I don't
think it's a correct code, we may need to find another solution.
Thanks.
[email protected]
From: Richard Sandiford
Date: 2023-10-20 06:19
To: juzhe.zhong\@rivai.ai
CC: gcc-patches; rguenther
Subject: Re: [PATCH V5] VECT: Enhance SLP of MASK_LEN_GATHER_LOAD[PR111721]
"[email protected]" <[email protected]> writes:
> Hi, this patch fix V4 issue:
>
> Previously as Richard S commented:
> https://gcc.gnu.org/pipermail/gcc-patches/2023-October/633178.html
>
> slp_op and mask_vectype are only initialised when mask_index >= 0.
> Shouldn't this code be under mask_index >= 0 too?
> Also, when do we encounter mismatched mask_vectypes? Presumably the SLP
> node has a known vectype by this point. I think a comment would be useful.
>
> Since I didn't encounter mismatched case in the regression of RISC-V and X86,
> so
> I fix it in V4 patch as follows:
> + if (mask_index >= 0 && slp_node)
> + {
> + bool match_p
> + = vect_maybe_update_slp_op_vectype (slp_op, mask_vectype);
> + gcc_assert (match_p);
> + }
> Add assertion here.
>
> However, recently an ICE suddenly appear today in RISC-V regression:
>
> FAIL: gcc.dg/tree-ssa/pr44306.c (internal compiler error: in
> vectorizable_load, at tree-vect-stmts.cc:9885)
> FAIL: gcc.dg/tree-ssa/pr44306.c (test for excess errors)
>
> This is because we are encountering that mask_vectype is boolean type and it
> is external def.
> Then vect_maybe_update_slp_op_vectype will return false.
>
> Then I fix this piece of code in V5 here:
>
> + if (mask_index >= 0 && slp_node
> + && !vect_maybe_update_slp_op_vectype (slp_op, mask_vectype))
> + {
> + /* We don't vectorize the boolean type external SLP mask. */
> + if (dump_enabled_p ())
> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> + "incompatible vector types for invariants\n");
> + return false;
> + }
>
> Bootstrap and Regression on x86 passed.
Why are external defs a problem though? E.g. in pr44306, it looks like
we should be able to create an invariant mask that contains
(!present[0]) || UseDefaultScalingMatrix8x8Flag[0]
(!present[1]) || UseDefaultScalingMatrix8x8Flag[1]
(!present[0]) || UseDefaultScalingMatrix8x8Flag[0]
(!present[1]) || UseDefaultScalingMatrix8x8Flag[1]
...repeating...
The type of the mask can be driven by the code that needs it.
Thanks,
Richard
>
> Thanks.
>
>
> [email protected]
>
> From: Juzhe-Zhong
> Date: 2023-10-18 20:36
> To: gcc-patches
> CC: richard.sandiford; rguenther; Juzhe-Zhong
> Subject: [PATCH V5] VECT: Enhance SLP of MASK_LEN_GATHER_LOAD[PR111721]
> This patch fixes this following FAILs in RISC-V regression:
>
> FAIL: gcc.dg/vect/vect-gather-1.c -flto -ffat-lto-objects scan-tree-dump
> vect "Loop contains only SLP stmts"
> FAIL: gcc.dg/vect/vect-gather-1.c scan-tree-dump vect "Loop contains only SLP
> stmts"
> FAIL: gcc.dg/vect/vect-gather-3.c -flto -ffat-lto-objects scan-tree-dump
> vect "Loop contains only SLP stmts"
> FAIL: gcc.dg/vect/vect-gather-3.c scan-tree-dump vect "Loop contains only SLP
> stmts"
>
> The root cause of these FAIL is that GCC SLP failed on MASK_LEN_GATHER_LOAD.
>
> We have 2 following situations of scalar recognized MASK_LEN_GATHER_LOAD:
>
> 1. conditional gather load: MASK_LEN_GATHER_LOAD (base, offset, scale, zero,
> condtional mask).
>
> This situation we just need to leverage the current MASK_GATHER_LOAD which
> can achieve SLP MASK_LEN_GATHER_LOAD.
>
> 2. un-conditional gather load: MASK_LEN_GATHER_LOAD (base, offset, scale,
> zero, -1)
>
> Current SLP check will failed on dummy mask -1, so we relax the check in
> tree-vect-slp.cc and allow it to be materialized.
>
> Consider this following case:
>
> void __attribute__((noipa))
> f (int *restrict y, int *restrict x, int *restrict indices, int n)
> {
> for (int i = 0; i < n; ++i)
> {
> y[i * 2] = x[indices[i * 2]] + 1;
> y[i * 2 + 1] = x[indices[i * 2 + 1]] + 2;
> }
> }
>
> https://godbolt.org/z/WG3M3n7Mo
>
> GCC unable to SLP using VEC_LOAD_LANES/VEC_STORE_LANES:
>
> f:
> ble a3,zero,.L5
> .L3:
> vsetvli a5,a3,e8,mf4,ta,ma
> vsetvli zero,a5,e32,m1,ta,ma
> vlseg2e32.v v6,(a2)
> vsetvli a4,zero,e64,m2,ta,ma
> vsext.vf2 v2,v6
> vsll.vi v2,v2,2
> vsetvli zero,a5,e32,m1,ta,ma
> vluxei64.v v1,(a1),v2
> vsetvli a4,zero,e64,m2,ta,ma
> vsext.vf2 v2,v7
> vsetvli zero,zero,e32,m1,ta,ma
> vadd.vi v4,v1,1
> vsetvli zero,zero,e64,m2,ta,ma
> vsll.vi v2,v2,2
> vsetvli zero,a5,e32,m1,ta,ma
> vluxei64.v v2,(a1),v2
> vsetvli a4,zero,e32,m1,ta,ma
> slli a6,a5,3
> vadd.vi v5,v2,2
> sub a3,a3,a5
> vsetvli zero,a5,e32,m1,ta,ma
> vsseg2e32.v v4,(a0)
> add a2,a2,a6
> add a0,a0,a6
> bne a3,zero,.L3
> .L5:
> ret
>
> After this patch:
>
> f:
> ble a3,zero,.L5
> li a5,1
> csrr t1,vlenb
> slli a5,a5,33
> srli a7,t1,2
> addi a5,a5,1
> slli a3,a3,1
> neg t3,a7
> vsetvli a4,zero,e64,m1,ta,ma
> vmv.v.x v4,a5
> .L3:
> minu a5,a3,a7
> vsetvli zero,a5,e32,m1,ta,ma
> vle32.v v1,0(a2)
> vsetvli a4,zero,e64,m2,ta,ma
> vsext.vf2 v2,v1
> vsll.vi v2,v2,2
> vsetvli zero,a5,e32,m1,ta,ma
> vluxei64.v v2,(a1),v2
> vsetvli a4,zero,e32,m1,ta,ma
> mv a6,a3
> vadd.vv v2,v2,v4
> vsetvli zero,a5,e32,m1,ta,ma
> vse32.v v2,0(a0)
> add a2,a2,t1
> add a0,a0,t1
> add a3,a3,t3
> bgtu a6,a7,.L3
> .L5:
> ret
>
> Note that I found we are missing conditional mask gather_load SLP test,
> Append a test for it in this patch.
>
> Tested on RISC-V and Bootstrap && Regression on X86 passed.
>
> Ok for trunk ?
>
> gcc/ChangeLog:
>
> * tree-vect-slp.cc (vect_get_operand_map): Add MASK_LEN_GATHER_LOAD.
> (vect_get_and_check_slp_defs): Ditto.
> (vect_build_slp_tree_1): Ditto.
> (vect_build_slp_tree_2): Ditto.
> * tree-vect-stmts.cc (vectorizable_load): Ditto.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.dg/vect/vect-gather-6.c: New test.
>
> ---
> gcc/testsuite/gcc.dg/vect/vect-gather-6.c | 15 +++++++++++++++
> gcc/tree-vect-slp.cc | 22 ++++++++++++++++++----
> gcc/tree-vect-stmts.cc | 12 +++++++++++-
> 3 files changed, 44 insertions(+), 5 deletions(-)
> create mode 100644 gcc/testsuite/gcc.dg/vect/vect-gather-6.c
>
> diff --git a/gcc/testsuite/gcc.dg/vect/vect-gather-6.c
> b/gcc/testsuite/gcc.dg/vect/vect-gather-6.c
> new file mode 100644
> index 00000000000..ff55f321854
> --- /dev/null
> +++ b/gcc/testsuite/gcc.dg/vect/vect-gather-6.c
> @@ -0,0 +1,15 @@
> +/* { dg-do compile } */
> +
> +void
> +f (int *restrict y, int *restrict x, int *restrict indices, int *restrict
> cond, int n)
> +{
> + for (int i = 0; i < n; ++i)
> + {
> + if (cond[i * 2])
> + y[i * 2] = x[indices[i * 2]] + 1;
> + if (cond[i * 2 + 1])
> + y[i * 2 + 1] = x[indices[i * 2 + 1]] + 2;
> + }
> +}
> +
> +/* { dg-final { scan-tree-dump "Loop contains only SLP stmts" vect { target
> vect_gather_load_ifn } } } */
> diff --git a/gcc/tree-vect-slp.cc b/gcc/tree-vect-slp.cc
> index d081999a763..146dba658a2 100644
> --- a/gcc/tree-vect-slp.cc
> +++ b/gcc/tree-vect-slp.cc
> @@ -552,6 +552,7 @@ vect_get_operand_map (const gimple *stmt, unsigned char
> swap = 0)
> return arg1_map;
> case IFN_MASK_GATHER_LOAD:
> + case IFN_MASK_LEN_GATHER_LOAD:
> return arg1_arg4_map;
> case IFN_MASK_STORE:
> @@ -719,8 +720,7 @@ vect_get_and_check_slp_defs (vec_info *vinfo, unsigned
> char swap,
> {
> tree type = TREE_TYPE (oprnd);
> dt = dts[i];
> - if ((dt == vect_constant_def
> - || dt == vect_external_def)
> + if (dt == vect_external_def
> && !GET_MODE_SIZE (vinfo->vector_mode).is_constant ()
> && (TREE_CODE (type) == BOOLEAN_TYPE
> || !can_duplicate_and_interleave_p (vinfo, stmts.length (),
> @@ -732,6 +732,17 @@ vect_get_and_check_slp_defs (vec_info *vinfo, unsigned
> char swap,
> "for variable-length SLP %T\n", oprnd);
> return -1;
> }
> + if (dt == vect_constant_def
> + && !GET_MODE_SIZE (vinfo->vector_mode).is_constant ()
> + && !can_duplicate_and_interleave_p (vinfo, stmts.length (), type))
> + {
> + if (dump_enabled_p ())
> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> + "Build SLP failed: invalid type of def "
> + "for variable-length SLP %T\n",
> + oprnd);
> + return -1;
> + }
> /* For the swapping logic below force vect_reduction_def
> for the reduction op in a SLP reduction group. */
> @@ -1096,7 +1107,8 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char
> *swap,
> if (cfn == CFN_MASK_LOAD
> || cfn == CFN_GATHER_LOAD
> - || cfn == CFN_MASK_GATHER_LOAD)
> + || cfn == CFN_MASK_GATHER_LOAD
> + || cfn == CFN_MASK_LEN_GATHER_LOAD)
> ldst_p = true;
> else if (cfn == CFN_MASK_STORE)
> {
> @@ -1357,6 +1369,7 @@ vect_build_slp_tree_1 (vec_info *vinfo, unsigned char
> *swap,
> if (DR_IS_READ (STMT_VINFO_DATA_REF (stmt_info))
> && rhs_code != CFN_GATHER_LOAD
> && rhs_code != CFN_MASK_GATHER_LOAD
> + && rhs_code != CFN_MASK_LEN_GATHER_LOAD
> /* Not grouped loads are handled as externals for BB
> vectorization. For loop vectorization we can handle
> splats the same we handle single element interleaving. */
> @@ -1857,7 +1870,8 @@ vect_build_slp_tree_2 (vec_info *vinfo, slp_tree node,
> if (gcall *stmt = dyn_cast <gcall *> (stmt_info->stmt))
> gcc_assert (gimple_call_internal_p (stmt, IFN_MASK_LOAD)
> || gimple_call_internal_p (stmt, IFN_GATHER_LOAD)
> - || gimple_call_internal_p (stmt, IFN_MASK_GATHER_LOAD));
> + || gimple_call_internal_p (stmt, IFN_MASK_GATHER_LOAD)
> + || gimple_call_internal_p (stmt, IFN_MASK_LEN_GATHER_LOAD));
> else
> {
> *max_nunits = this_max_nunits;
> diff --git a/gcc/tree-vect-stmts.cc b/gcc/tree-vect-stmts.cc
> index e5ff44c25f1..0bc7beb1342 100644
> --- a/gcc/tree-vect-stmts.cc
> +++ b/gcc/tree-vect-stmts.cc
> @@ -9878,12 +9878,22 @@ vectorizable_load (vec_info *vinfo,
> return false;
> mask_index = internal_fn_mask_index (ifn);
> + slp_tree slp_op = NULL;
> if (mask_index >= 0 && slp_node)
> mask_index = vect_slp_child_index_for_operand (call, mask_index);
> if (mask_index >= 0
> && !vect_check_scalar_mask (vinfo, stmt_info, slp_node, mask_index,
> - &mask, NULL, &mask_dt, &mask_vectype))
> + &mask, &slp_op, &mask_dt, &mask_vectype))
> return false;
> + if (mask_index >= 0 && slp_node
> + && !vect_maybe_update_slp_op_vectype (slp_op, mask_vectype))
> + {
> + /* We don't vectorize the boolean type external SLP mask. */
> + if (dump_enabled_p ())
> + dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
> + "incompatible vector types for invariants\n");
> + return false;
> + }
> }
> tree vectype = STMT_VINFO_VECTYPE (stmt_info);