https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119156

            Bug ID: 119156
           Summary: Placement of PTRUE instructions prevents PTEST
                    elimination
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
                CC: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64*-*-*

Tamar pointed out that for:

#include <arm_sve.h>

void inner_loop_029(double *restrict input, int64_t *restrict scale,
                           double *restrict output, int64_t size) {
    svbool_t p;
    int64_t i = 0;
    while (p = svwhilelt_b64(i, size), svptest_first(svptrue_b64(), p)) {
        svst1(p, output+i, svld1(p, input+i));
        i += svcntd();
    }
}

we generate:

inner_loop_029:
        whilelt p14.d, xzr, x3
        ptrue   p15.d, all
        mov     p7.b, p14.b
        ptest   p15, p14.b
        b.nfrst .L1
        mov     x1, 0
        cntd    x4
.L3:
        ld1d    z31.d, p7/z, [x0, x1, lsl 3]
        st1d    z31.d, p7, [x2, x1, lsl 3]
        add     x1, x1, x4
        whilelt p7.d, x1, x3
        b.first .L3
.L1:
        ret

where we successfully eliminate the PTEST for the second WHILELT but not for
the first.

This is caused by an unfortunate instruction order in the input to cc_fusion:

(insn 16 15 17 2 (parallel [
            (set (reg:VNx2BI 115)
                (unspec:VNx2BI [
                        (const_int 0 [0]) repeated x2
                        (reg:DI 129 [ size ])
                    ] UNSPEC_WHILELT))
            (clobber (reg:CC_NZC 66 cc))
        ]) "/app/example.c":7:16 10022 {while_ltdivnx2bi}
     (nil))
(insn 17 16 18 2 (set (reg/v:VNx16BI 104 [ p ])
        (subreg:VNx16BI (reg:VNx2BI 115) 0)) "/app/example.c":7:16 5714
{*aarch64_sve_movvnx16bi}
     (nil))
(debug_insn 18 17 19 2 (var_location:VNx16BI p (subreg:VNx16BI (reg:VNx2BI 115)
0)) "/app/example.c":7:16 -1
     (nil))
(insn 19 18 21 2 (set (reg:VNx16BI 116)
        (const_vector:VNx16BI repeat [
                (const_int 1 [0x1])
                (const_int 0 [0]) repeated x7
            ])) "/app/example.c":7:40 discrim 1 5714 {*aarch64_sve_movvnx16bi}
     (nil))
(note 21 19 22 2 NOTE_INSN_DELETED)
(note 22 21 24 2 NOTE_INSN_DELETED)
(note 24 22 25 2 NOTE_INSN_DELETED)
(insn 25 24 26 2 (set (reg:CC_NZC 66 cc)
        (unspec:CC_NZC [
                (reg:VNx16BI 116)
                (subreg:VNx2BI (reg:VNx16BI 116) 0)
                (const_int 1 [0x1])
                (reg:VNx2BI 115)
            ] UNSPEC_PTEST)) "/app/example.c":7:12 discrim 2 10304
{aarch64_ptestvnx2bi}
     (expr_list:REG_DEAD (reg:VNx2BI 115)
        (nil)))

The fused instruction that we want to generate for insn 25 would set register
115, and so need to be before insn 17.  But it also uses the ptrue generated by
insn 19.  It therefore isn't possible to place the fused instruction without
reordering something else.

One thing that I'd been thinking of trying for a while is to emit unique ptrues
at the beginning of the function, so that they can be reused throughout.  We
already do something similar for Advanced SIMD zeros.

Adding:

    if (<MODE>mode == VNx16BImode
        && aarch64_ptrue_all_mode (operands[1]).exists ())
      {
        rtx tmp = gen_reg_rtx (<MODE>mode);
        emit_insn_before (gen_rtx_SET (tmp, operands[1]),
                          function_beg_insn);
        emit_move_insn (operands[0], tmp);
        DONE;
      }

to the predicate move patterns fixes it, but the real fix would be to use
aarch64_get_shareable_reg.

Reply via email to