https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119156
Bug ID: 119156 Summary: Placement of PTRUE instructions prevents PTEST elimination Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org CC: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64*-*-* Tamar pointed out that for: #include <arm_sve.h> void inner_loop_029(double *restrict input, int64_t *restrict scale, double *restrict output, int64_t size) { svbool_t p; int64_t i = 0; while (p = svwhilelt_b64(i, size), svptest_first(svptrue_b64(), p)) { svst1(p, output+i, svld1(p, input+i)); i += svcntd(); } } we generate: inner_loop_029: whilelt p14.d, xzr, x3 ptrue p15.d, all mov p7.b, p14.b ptest p15, p14.b b.nfrst .L1 mov x1, 0 cntd x4 .L3: ld1d z31.d, p7/z, [x0, x1, lsl 3] st1d z31.d, p7, [x2, x1, lsl 3] add x1, x1, x4 whilelt p7.d, x1, x3 b.first .L3 .L1: ret where we successfully eliminate the PTEST for the second WHILELT but not for the first. This is caused by an unfortunate instruction order in the input to cc_fusion: (insn 16 15 17 2 (parallel [ (set (reg:VNx2BI 115) (unspec:VNx2BI [ (const_int 0 [0]) repeated x2 (reg:DI 129 [ size ]) ] UNSPEC_WHILELT)) (clobber (reg:CC_NZC 66 cc)) ]) "/app/example.c":7:16 10022 {while_ltdivnx2bi} (nil)) (insn 17 16 18 2 (set (reg/v:VNx16BI 104 [ p ]) (subreg:VNx16BI (reg:VNx2BI 115) 0)) "/app/example.c":7:16 5714 {*aarch64_sve_movvnx16bi} (nil)) (debug_insn 18 17 19 2 (var_location:VNx16BI p (subreg:VNx16BI (reg:VNx2BI 115) 0)) "/app/example.c":7:16 -1 (nil)) (insn 19 18 21 2 (set (reg:VNx16BI 116) (const_vector:VNx16BI repeat [ (const_int 1 [0x1]) (const_int 0 [0]) repeated x7 ])) "/app/example.c":7:40 discrim 1 5714 {*aarch64_sve_movvnx16bi} (nil)) (note 21 19 22 2 NOTE_INSN_DELETED) (note 22 21 24 2 NOTE_INSN_DELETED) (note 24 22 25 2 NOTE_INSN_DELETED) (insn 25 24 26 2 (set (reg:CC_NZC 66 cc) (unspec:CC_NZC [ (reg:VNx16BI 116) (subreg:VNx2BI (reg:VNx16BI 116) 0) (const_int 1 [0x1]) (reg:VNx2BI 115) ] UNSPEC_PTEST)) "/app/example.c":7:12 discrim 2 10304 {aarch64_ptestvnx2bi} (expr_list:REG_DEAD (reg:VNx2BI 115) (nil))) The fused instruction that we want to generate for insn 25 would set register 115, and so need to be before insn 17. But it also uses the ptrue generated by insn 19. It therefore isn't possible to place the fused instruction without reordering something else. One thing that I'd been thinking of trying for a while is to emit unique ptrues at the beginning of the function, so that they can be reused throughout. We already do something similar for Advanced SIMD zeros. Adding: if (<MODE>mode == VNx16BImode && aarch64_ptrue_all_mode (operands[1]).exists ()) { rtx tmp = gen_reg_rtx (<MODE>mode); emit_insn_before (gen_rtx_SET (tmp, operands[1]), function_beg_insn); emit_move_insn (operands[0], tmp); DONE; } to the predicate move patterns fixes it, but the real fix would be to use aarch64_get_shareable_reg.