On Sun, Dec 1, 2024 at 4:51 PM Dusan Stojkovic
<dusan.stojko...@rt-rk.com> wrote:
>
> This patch introduces partial vectorization support for VEC_PERM_EXPR
> when vector types specified are not supported on some architectures
> because of their size.
>
> Take, for instance, this vector type:
> typedef int32_t vnx32si __attribute__ ((vector_size (128)));
>
> For -march=rv64gcv_zvl256b GCC doesn't vectorize the following code:
> __attribute__ ((noipa)) void permute_vnx32si (vnx32si values1,
> vnx32si values2,
>        vnx32si *out) {
>    vnx32si v =
>         __builtin_shufflevector (values1, values2, 1, 1, 1, 1, 1, 1, 1, 1,
>                                                    1, 1, 1, 1, 1, 1, 1, 1,
>                                                    1, 1, 1, 1, 1, 1, 1, 1,
>                                                    1, 1, 1, 1, 1, 1, 1, 1);
>     *(vnx32si *) out = v;
> }
>
> GCC produces 1xlw and 32xsw instructions, reading values1[1]
> and storing it using a scalar instruction.
>
> With this patch applied the resulting assembly would look like:
> permute_vnx32si:
>         vsetivli        zero,8,e32,m1,ta,ma
>         vle32.v v2,0(a0)
>         addi    sp,sp,-128
>         addi    a3,a2,32
>         addi    a4,a2,64
>         addi    a5,a2,96
>         vrgather.vi     v1,v2,1
>         vse32.v v1,0(a2)
>         vse32.v v1,0(a3)
>         vse32.v v1,0(a4)
>         vse32.v v1,0(a5)
>         addi    sp,sp,128
>         jr      ra
>
> With this patch vectorization of VEC_PERM_EXPR is possible for indexes
> which are bounded by split_type. split_type is the biggest vector type
> which is supported based on can_vec_perm_var_p. In this case it is
> 8 elements * 32 element size = 256 bits.
>
> The optimization works by iterating through the vector and sorting
> element values in two groups: values which are in the
> [0, split_elements) range they can be vectorized by reading from the
> first address of vector and ones which are greater than split_elements.
> Since there are two vectors to choose from for lowering the valid values
> for the second vector are [elements, split_elements).
> The values which are not in their respectable range are handled
> using scalar instructions.
>
> Values out of range are set to 0 in the original vector and are written
> to twice: first when the vector store instruction writes the value 0
> initially, second when the scalar store corrects the error.
>
> There is a condition, however, when this approach produces poor code.
> The worst case is when all indexes are out of range. This would produce
> something like:
>         vsetivli        zero,8,e32,m1,ta,ma # config
>         vle32.v v2,0(a0)  # One load for start of vector
>         addi    a3,a2,off # off = i*split_elements,
>                           # i   = [0,elements/split_elemets - 1)
>                           # totaling div - 1  addi for vse32
>         vrgather.vi       # v1,v2,1 # singular
>         vse32.v v1,0(a[i])  # store for each
>                             # i = [0,elements/split_elemets - 1)
>                             # totaling div  vse32
>
>     + a scalar load and multiple stores for each element.
>
> Counting up the vector code inserted together: 2 * div + 1
> if all insns cost 1. This is the reasoning behind arbitrary constraint:
>     s_assignments.length () / 2 > elements - (2 * div + 1)
> For if there are a greater number of scalar assignments the code would
> produce redundant vector instructions.
>
> Tested on risc-v 64 and risc-v 32, no regressions.
>
> PS. In the vector instruction example there are two add instructions
> working on the stack pointer register. I'm not quite sure about the
> purpose of these instructions.

Note this has to wait for next stage1 for integration.

I think splitting the permute should not look at uses (stores), that's
not the job
of lowering.  Instead currently lowering would change

 _1 = VEC_PERM <_1, _2, { .. }>;

to

 _11 = BIT_FIELD_REF <_1, ... lower part>;
 _12 = BIT_FIELD_REF <_1, ... upper part>;
... same for _2 ...
 _31 = VEC_PERM <part1, part2, { ... }>
 _32 = VEC_PERM <part1', part2', { ... }>
 _1 = { _31, _32 };

that is, vector lowering elsewhere tries to see whether it can use a smaller
vector type that it can extract parts from incoming vectors and compose the
outgoing vector from.

Dealing with stores from such composed vectors is not done by lowering
(it doesn't touch loads or stores), but the composition is elided by forwprop.
lowering can deal with avoiding intermediate compose/extract between stmts.

So - the current

  if (TREE_CODE (mask) == VECTOR_CST
      && tree_to_vec_perm_builder (&sel_int, mask))
    {
...
     }

check would need to be factored out and we'd probably try other compute
mode based on the permutation mask (we can't handle cross "sub-vector"
permutes in the above way).  An alternative "sub-vector" construction
might be to use even/odd extracts, but mind all other lowering would always
do the above scheme, so it wouldn't inter-operate well.

Richard.


> 2024-31-11  Dusan Stojkovic  <dusan.stojko...@rt-rk.com>
>
> PR target/116425
>
> gcc/ChangeLog:
>
> * tree-vect-generic.cc (split_lower_vec_perm): New function.
> (lower_vec_perm): New if condition.
>
> gcc/testsuite/ChangeLog:
>
> * gcc.target/riscv/rvv/autovec/pr116425-run.c: New test.
> * gcc.target/riscv/rvv/autovec/pr116425.c: New test.
>
>
> CONFIDENTIALITY: The contents of this e-mail are confidential and intended 
> only for the above addressee(s). If you are not the intended recipient, or 
> the person responsible for delivering it to the intended recipient, copying 
> or delivering it to anyone else or using it in any unauthorized manner is 
> prohibited and may be unlawful. If you receive this e-mail by mistake, please 
> notify the sender and the systems administrator at straym...@rt-rk.com 
> immediately.

Reply via email to