Re: [aarch64] Code-gen for vector initialization involving constants

Prathamesh Kulkarni via Gcc-patches Wed, 24 May 2023 02:30:30 -0700

On Mon, 22 May 2023 at 14:18, Richard Sandiford
<richard.sandif...@arm.com> wrote:
>
> Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes:
> > Hi Richard,
> > Thanks for the suggestions. Does the attached patch look OK ?
> > Boostrap+test in progress on aarch64-linux-gnu.
>
> Like I say, please wait for the tests to complete before sending an RFA.
> It saves a review cycle if the tests don't in fact pass.
Right, sorry, will post patches after completion of testing henceforth.
>
> > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc
> > index 29dbacfa917..e611a7cca25 100644
> > --- a/gcc/config/aarch64/aarch64.cc
> > +++ b/gcc/config/aarch64/aarch64.cc
> > @@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode mode, rtx 
> > vals, bool even_p)
> >    return gen_rtx_PARALLEL (new_mode, vec);
> >  }
> >
> > +/* Return true if INSN is a scalar move.  */
> > +
> > +static bool
> > +scalar_move_insn_p (const rtx_insn *insn)
> > +{
> > +  rtx set = single_set (insn);
> > +  if (!set)
> > +    return false;
> > +  rtx src = SET_SRC (set);
> > +  rtx dest = SET_DEST (set);
> > +  return is_a<scalar_mode>(GET_MODE (dest))
> > +      && aarch64_mov_operand_p (src, GET_MODE (src));
>
> Formatting:
>
>   return (is_a<scalar_mode>(GET_MODE (dest))
>           && aarch64_mov_operand_p (src, GET_MODE (src)));
>
> OK with that change if the tests pass, thanks.
Unfortunately, the patch regressed vec-init-21.c:


int8x16_t f_s8(int8_t x, int8_t y)
{
  return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6,
                       7, 8, 9, 10, 11, 12, 13, 14 };
}

-O3 code-gen trunk:
f_s8:
        adrp    x2, .LC0
        ldr     q0, [x2, #:lo12:.LC0]
        ins     v0.b[0], w0
        ins     v0.b[1], w1
        ret

-O3 code-gen patch:
f_s8:
        adrp    x2, .LC0
        ldr     d31, [x2, #:lo12:.LC0]
        adrp    x2, .LC1
        ldr     d0, [x2, #:lo12:.LC1]
        ins     v31.b[0], w0
        ins     v0.b[0], w1
        zip1    v0.16b, v31.16b, v0.16b
        ret

With trunk, it chooses the fallback sequence because both fallback
and zip1 sequence had cost = 20, however with patch applied,
we end up with zip1 sequence cost = 24 and fallback sequence
cost = 28.

This happens because of using insn_cost instead of
set_rtx_cost for the following expression:
(set (reg:QI 100)
    (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0))
set_rtx_cost returns 0 for above expression but insn_cost returns 4.

This expression template appears twice in fallback sequence, which raises
the cost to 28 from 20, while it appears once in each half of zip1 sequence,
which raises the cost to 24 from 20, and so it now prefers zip1 sequence
instead.

I assumed this expression would be ignored because it looks like a scalar move,
but that doesn't seem to be the case ?
aarch64_classify_symbolic_expression returns
SYMBOL_FORCE_TO_MEM for (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)
and thus aarch64_mov_operand_p returns false.

Another issue with the zip1 sequence above is using same register x2
for loading another half of constant in:
adrp    x2, .LC1

I guess this will create an output dependency from adrp x2, .LC0 ->
adrp x2, .LC1
and anti-dependency from  ldr d31, [x2, #:lo12:.LC0] -> adrp x2, .LC1
essentially forcing almost the entire sequence (except ins
instructions) to execute sequentially ?

Fallback sequence rtl, cost = 28
(set (reg:V16QI 96)
    (const_vector:V16QI [
            (const_int 7 [0x7])
            (const_int 8 [0x8])
            (const_int 1 [0x1])
            (const_int 2 [0x2])
            (const_int 3 [0x3])
            (const_int 4 [0x4])
            (const_int 5 [0x5])
            (const_int 6 [0x6])
            (const_int 7 [0x7])
            (const_int 8 [0x8])
            (const_int 9 [0x9])
            (const_int 10 [0xa])
            (const_int 11 [0xb])
            (const_int 12 [0xc])
            (const_int 13 [0xd])
            (const_int 14 [0xe])
        ]))
cost = 12
(set (reg:QI 101)
    (subreg/s/u:QI (reg/v:SI 93 [ x ]) 0))
cost = 4
(set (reg:V16QI 96)
    (vec_merge:V16QI (vec_duplicate:V16QI (reg:QI 101))
        (reg:V16QI 96)
        (const_int 1 [0x1])))
cost = 4
(set (reg:QI 102)
    (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0))
cost = 4
(set (reg:V16QI 96)
    (vec_merge:V16QI (vec_duplicate:V16QI (reg:QI 102))
        (reg:V16QI 96)
        (const_int 2 [0x2])))
cost = 4

zip1 sequence rtl, cost = 24
(set (reg:V8QI 97)
    (const_vector:V8QI [
            (const_int 7 [0x7])
            (const_int 1 [0x1])
            (const_int 3 [0x3])
            (const_int 5 [0x5])
            (const_int 7 [0x7])
            (const_int 9 [0x9])
            (const_int 11 [0xb])
            (const_int 13 [0xd])
        ]))
cost = 12
(set (reg:QI 98)
    (subreg/s/u:QI (reg/v:SI 93 [ x ]) 0))
cost = 4
(set (reg:V8QI 97)
    (vec_merge:V8QI (vec_duplicate:V8QI (reg:QI 98))
        (reg:V8QI 97)
        (const_int 1 [0x1])))
cost = 4
(set (reg:V8QI 99)
    (const_vector:V8QI [
            (const_int 8 [0x8])
            (const_int 2 [0x2])
            (const_int 4 [0x4])
            (const_int 6 [0x6])
            (const_int 8 [0x8])
            (const_int 10 [0xa])
            (const_int 12 [0xc])
            (const_int 14 [0xe])
        ]))
cost = 12
(set (reg:QI 100)
    (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0))
cost = 4
(set (reg:V8QI 99)
    (vec_merge:V8QI (vec_duplicate:V8QI (reg:QI 100))
        (reg:V8QI 99)
        (const_int 1 [0x1])))
cost = 4
(set (reg:V16QI 96)
    (unspec:V16QI [
            (subreg:V16QI (reg:V8QI 97) 0)
            (subreg:V16QI (reg:V8QI 99) 0)
        ] UNSPEC_ZIP1))
cost = 4

Thanks,
Prathamesh
>
> Richard

Re: [aarch64] Code-gen for vector initialization involving constants

Reply via email to