On Mon, 22 May 2023 at 14:18, Richard Sandiford <richard.sandif...@arm.com> wrote: > > Prathamesh Kulkarni <prathamesh.kulka...@linaro.org> writes: > > Hi Richard, > > Thanks for the suggestions. Does the attached patch look OK ? > > Boostrap+test in progress on aarch64-linux-gnu. > > Like I say, please wait for the tests to complete before sending an RFA. > It saves a review cycle if the tests don't in fact pass. Right, sorry, will post patches after completion of testing henceforth. > > > diff --git a/gcc/config/aarch64/aarch64.cc b/gcc/config/aarch64/aarch64.cc > > index 29dbacfa917..e611a7cca25 100644 > > --- a/gcc/config/aarch64/aarch64.cc > > +++ b/gcc/config/aarch64/aarch64.cc > > @@ -22332,6 +22332,43 @@ aarch64_unzip_vector_init (machine_mode mode, rtx > > vals, bool even_p) > > return gen_rtx_PARALLEL (new_mode, vec); > > } > > > > +/* Return true if INSN is a scalar move. */ > > + > > +static bool > > +scalar_move_insn_p (const rtx_insn *insn) > > +{ > > + rtx set = single_set (insn); > > + if (!set) > > + return false; > > + rtx src = SET_SRC (set); > > + rtx dest = SET_DEST (set); > > + return is_a<scalar_mode>(GET_MODE (dest)) > > + && aarch64_mov_operand_p (src, GET_MODE (src)); > > Formatting: > > return (is_a<scalar_mode>(GET_MODE (dest)) > && aarch64_mov_operand_p (src, GET_MODE (src))); > > OK with that change if the tests pass, thanks. Unfortunately, the patch regressed vec-init-21.c:
int8x16_t f_s8(int8_t x, int8_t y) { return (int8x16_t) { x, y, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 }; } -O3 code-gen trunk: f_s8: adrp x2, .LC0 ldr q0, [x2, #:lo12:.LC0] ins v0.b[0], w0 ins v0.b[1], w1 ret -O3 code-gen patch: f_s8: adrp x2, .LC0 ldr d31, [x2, #:lo12:.LC0] adrp x2, .LC1 ldr d0, [x2, #:lo12:.LC1] ins v31.b[0], w0 ins v0.b[0], w1 zip1 v0.16b, v31.16b, v0.16b ret With trunk, it chooses the fallback sequence because both fallback and zip1 sequence had cost = 20, however with patch applied, we end up with zip1 sequence cost = 24 and fallback sequence cost = 28. This happens because of using insn_cost instead of set_rtx_cost for the following expression: (set (reg:QI 100) (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)) set_rtx_cost returns 0 for above expression but insn_cost returns 4. This expression template appears twice in fallback sequence, which raises the cost to 28 from 20, while it appears once in each half of zip1 sequence, which raises the cost to 24 from 20, and so it now prefers zip1 sequence instead. I assumed this expression would be ignored because it looks like a scalar move, but that doesn't seem to be the case ? aarch64_classify_symbolic_expression returns SYMBOL_FORCE_TO_MEM for (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0) and thus aarch64_mov_operand_p returns false. Another issue with the zip1 sequence above is using same register x2 for loading another half of constant in: adrp x2, .LC1 I guess this will create an output dependency from adrp x2, .LC0 -> adrp x2, .LC1 and anti-dependency from ldr d31, [x2, #:lo12:.LC0] -> adrp x2, .LC1 essentially forcing almost the entire sequence (except ins instructions) to execute sequentially ? Fallback sequence rtl, cost = 28 (set (reg:V16QI 96) (const_vector:V16QI [ (const_int 7 [0x7]) (const_int 8 [0x8]) (const_int 1 [0x1]) (const_int 2 [0x2]) (const_int 3 [0x3]) (const_int 4 [0x4]) (const_int 5 [0x5]) (const_int 6 [0x6]) (const_int 7 [0x7]) (const_int 8 [0x8]) (const_int 9 [0x9]) (const_int 10 [0xa]) (const_int 11 [0xb]) (const_int 12 [0xc]) (const_int 13 [0xd]) (const_int 14 [0xe]) ])) cost = 12 (set (reg:QI 101) (subreg/s/u:QI (reg/v:SI 93 [ x ]) 0)) cost = 4 (set (reg:V16QI 96) (vec_merge:V16QI (vec_duplicate:V16QI (reg:QI 101)) (reg:V16QI 96) (const_int 1 [0x1]))) cost = 4 (set (reg:QI 102) (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)) cost = 4 (set (reg:V16QI 96) (vec_merge:V16QI (vec_duplicate:V16QI (reg:QI 102)) (reg:V16QI 96) (const_int 2 [0x2]))) cost = 4 zip1 sequence rtl, cost = 24 (set (reg:V8QI 97) (const_vector:V8QI [ (const_int 7 [0x7]) (const_int 1 [0x1]) (const_int 3 [0x3]) (const_int 5 [0x5]) (const_int 7 [0x7]) (const_int 9 [0x9]) (const_int 11 [0xb]) (const_int 13 [0xd]) ])) cost = 12 (set (reg:QI 98) (subreg/s/u:QI (reg/v:SI 93 [ x ]) 0)) cost = 4 (set (reg:V8QI 97) (vec_merge:V8QI (vec_duplicate:V8QI (reg:QI 98)) (reg:V8QI 97) (const_int 1 [0x1]))) cost = 4 (set (reg:V8QI 99) (const_vector:V8QI [ (const_int 8 [0x8]) (const_int 2 [0x2]) (const_int 4 [0x4]) (const_int 6 [0x6]) (const_int 8 [0x8]) (const_int 10 [0xa]) (const_int 12 [0xc]) (const_int 14 [0xe]) ])) cost = 12 (set (reg:QI 100) (subreg/s/u:QI (reg/v:SI 94 [ y ]) 0)) cost = 4 (set (reg:V8QI 99) (vec_merge:V8QI (vec_duplicate:V8QI (reg:QI 100)) (reg:V8QI 99) (const_int 1 [0x1]))) cost = 4 (set (reg:V16QI 96) (unspec:V16QI [ (subreg:V16QI (reg:V8QI 97) 0) (subreg:V16QI (reg:V8QI 99) 0) ] UNSPEC_ZIP1)) cost = 4 Thanks, Prathamesh > > Richard