https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120383

            Bug ID: 120383
           Summary: Improving early break unrolled sequences with Adv.
                    SIMD
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: tnfchris at gcc dot gnu.org
            Blocks: 53947, 115130
  Target Milestone: ---
            Target: aarch64*

Today if we unroll an early break loop such as:

#define N 640
long long a[N] = {};
long long b[N] = {};

int f1 (long long c)
{
  for (int i = 0; i < N; i++)
    {
      if (a[i] == c)
        return 1;
    }
  return 0;
}

we generate an ORR reduction followed by a compression sequence when using Adv.
SIMD.

        ldp     q31, q30, [x1], 32
        cmeq    v31.2d, v31.2d, v27.2d
        cmeq    v30.2d, v30.2d, v27.2d
        orr     v31.16b, v31.16b, v30.16b
        umaxp   v31.4s, v31.4s, v31.4s
        fmov    x4, d31
        cbz     x4, .L2
        fmov    w1, s29

However the dependency chain here is long enough that it removes the vector
profitability.

This sequence can however be replaced by:

        ldp     q31, q30, [x0], 32
        cmeq    v31.2d, v31.2d, v29.2d
        cmeq    v30.2d, v30.2d, v29.2d
        addhn   v31.2s, v31.2d, v30.2d
        fmov    x2, d31
        cbz     x2, .L15

with addhn replacing the ORR reduction and the UMAXP compression.

When using 3 compare statements, we can use nested addhn, when 4 we can create
2 ORRs + ADDHN.

The AArch64 ADDHN (Add and Halve Narrow) instruction performs an addition of
two vectors of the same size, then truncates (narrow) the result by keeping
only the higher half of each element.

This means that this is hard to do in RTL as you wouldn't be able to match this
long sequence with combine, and the intermediate combinations aren't valid. for
instance it's only possible when the vector mask values are 0 and -1 and so we
would need to know that the values in the registers are vector mask values.

An RTL folder won't work either as it won't let us get the nested variant.

Which leaves a gimple folder or support in the vectorizer for this.

By far the simplest version is using the vectorizer, as it knows about mask
type (e.g. VECTOR_BOOLEAN_P (..)) , it knows about the precision of the mask
type and it's the one generating the sequence so it can choose how to do the
reductions.

However for this to work we have to introduce an optab for addhn. Open coding
the sequence doesn't work as we don't have a way to describe that the addition
is done as a higher precision.

With the optab the final codegen can generate a scalar cbranch rather than a
vector one and gets the result we want.

This makes unrolling an early break loop much more profitable on AArch64.

Having looked into the options and describing the limitations above, are you ok
with a new optab
 for addhn Richi?


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115130
[Bug 115130] [meta-bug] early break vectorization

Reply via email to