https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68796

            Bug ID: 68796
           Summary: Make use of and-immediate+compare instructions more
                    robust
           Product: gcc
           Version: 6.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: rtl-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

The aarch64 target (and others) has an instruction to perform a bitmask and
compare the result for equality against zero: the TST instruction.
In config/aarch64/aarch64.md its pattern is represented as:
(define_insn "*and<mode>3nr_compare0"
  [(set (reg:CC_NZ CC_REGNUM)
        (compare:CC_NZ
         (and:GPI (match_operand:GPI 0 "register_operand" "%r,r")
                  (match_operand:GPI 1 "aarch64_logical_operand" "r,<lconst>"))
         (const_int 0)))]
  ""
  "tst\\t%<w>0, %<w>1"
  [(set_attr "type" "logics_reg,logics_imm")]
)

We should be utilising it more than we are at the moment.
Some examples where we should be using it but are not:
int
f8 (int x)
{
  if (x & 8)
    return 1;
  return x;
}

for -O2 on an aarch64 toolchain we generate:
f8:
        and     w1, w0, 8
        cmp     w1, wzr
        csinc   w0, w0, wzr, eq
        ret

when we could be doing:
f8:
        tst     x0, 8
        csinc   w0, w0, wzr, eq
        ret

Combine tries to match some patterns like:
(set (reg:CC 66 cc)
    (compare:CC (zero_extract:DI (reg:DI 0 x0 [ x ])
            (const_int 1 [0x1])
            (const_int 3 [0x3]))
        (const_int 0 [0])))

and
(set (reg:CC 66 cc)
    (compare:CC (and:DI (lshiftrt:DI (reg:DI 0 x0 [ x ])
                (const_int 3 [0x3]))
            (const_int 1 [0x1]))
        (const_int 0 [0])))

but not the and-immediate case for which we have a pattern.
Now, the documentation in md.texi on the topic of canonicalization says this:
" Equality comparisons of a group of bits (usually a single bit) with zero
  will be written using @code{zero_extract} rather than the equivalent
  @code{and} or @code{sign_extract} operations."

So, the aarch64 port is arguably at fault here for not providing the canonical
form here.
Indeed, adding a pattern that matches something like:
  [(set (reg:CC_NZ CC_REGNUM)
        (compare:CC_NZ
         (zero_extract:GPI (match_operand:GPI 0 "register_operand" "r")
                  (match_operand:GPI 1 "const_int_operand" "n")
                  (match_operand:GPI 2 "const_int_operand" "n"))
         (const_int 0)))]

and converts operands 1 and 2 into the appropriate bitmask for the TST
instruction works
(with an appropriate change to the SELECT_CC_MODE implementation for aarch64 to
allow the zero_extract case).

However, consider the case where we have an immediate that corresponds to the
mode mask of a short mode like QImode:
int
f255 (int x)
{
  if (x & 255)
    return 1;
  return x;
}

Again, we generate:
f255:
        and     w1, w0, 255
        cmp     w1, wzr
        csinc   w0, w0, wzr, eq
        ret

instead of:
f255:
        tst     w0, 255
        csinc   w0, w0, wzr, eq
        ret

even with the zero_extract pattern in place.

Here, combine doesn't try to match a zero-extract variant nor an AND-immediate
form.
It creates a QImode subreg of the operand and tries to perform the comparison
in the shortest mode.
That is, it tries to match:
(set (reg:CC_ZESWP 66 cc)
    (compare:CC_ZESWP (reg:QI 0 x0 [ x ])
        (const_int 0 [0])))

For which we have no pattern.
Again, I could add a pattern to aarch64.md to match something like:
(define_insn "*and<mode>_compare0"
  [(set (reg:CC_NZ CC_REGNUM)
        (compare:CC_NZ
         (match_operand:SHORT 0 "register_operand" "r")
         (const_int 0)))]

where SHORT is a mode iterator over QImode and HImode and generate a TST
instruction with the mode mask and modify
SELECT_CC_MODE appropriately.

However, it seems to me that the general and-immediate+compare form of the
instruction should be matched more robustly
rather than having to specify all the different equivalent forms in the MD
files.

Making the above changes to the aarch64 backend ends up generating 42% more tst
instructions across SPEC2006 and ends up reducing
the number of cmp instructions generated by about 9% and the generated code
always looks simpler due to us avoiding performing an
AND-immediate operation followed by a compare or a compare+branch.

Is there any way we can make combine try harder to match these equivalent
forms?
Or should we go ahead and add these equivalent forms of the
and-immediate+compare pattern?
I think adding a pattern to handle the zero_extract case is not unreasonable
since the canonicalization rules dictate it,
but for the second case, perhaps we should not be trying short subregs for
which there is no native instruction

Reply via email to