https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79934

            Bug ID: 79934
           Summary: Vectorization of descending-index loops can produce
                    unnecessary permutes
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: wschmidt at gcc dot gnu.org
                CC: anton at samba dot org, rguenth at gcc dot gnu.org
  Target Milestone: ---

Consider

long *a, *c;
int b;
void d() {
  for (; b; b--)
    a[b] |= c[b];
}

On POWER9, the core vectorized loop for this looks like this:

.L8:
        lxvx 0,6,9           // load a[b] and a[b-1]
        lxvx 12,8,9          // load c[b] and c[b-1]
        xxpermdi 0,0,0,2     // swap elements of a
        xxpermdi 12,12,12,2  // swap elements of c
        xxlor 0,0,12         // produce a | c
        xxpermdi 0,0,0,2     // swap elements of result
        stxvx 0,6,9          // store a[b] and a[b-1]
        addi 9,9,-16
        bdnz .L8

The xxpermdi instructions are reversing the order of vector elements in a
register.  However, the only operation being performed on these values is
"lane-insensitive."  Since the swaps occur on all inputs and outputs to the
computation, and none of the values escape the iteration, performing the
operation in the "wrong" lanes does not affect the result.  Better code would
be:

.L8:
        lxvx 0,6,9           // load a[b] and a[b-1]
        lxvx 12,8,9          // load c[b] and c[b-1]
        xxlor 0,0,12         // produce a | c in "wrong" lanes
        stxvx 0,6,9          // store a[b] and a[b-1]
        addi 9,9,-16
        bdnz .L8

This code is introduced by the vectorizer:

  <bb 10> [55.08%]:
  # b.7_18 = PHI <_10(11), b.7_78(9)>
  # vectp.21_110 = PHI <vectp.21_111(11), vectp.22_104(9)>
  # vectp.25_119 = PHI <vectp.25_120(11), vectp.26_114(9)>
  # vectp.30_129 = PHI <vectp.30_130(11), vectp.31_124(9)>
  # ivtmp_133 = PHI <ivtmp_134(11), 0(9)>
  _2 = (long unsigned int) b.7_18;
  _3 = _2 * 8;
  _4 = a.0_1 + _3;
  vect__5.23_112 = MEM[(long int *)vectp.21_110];
  vect__5.24_113 = VEC_PERM_EXPR <vect__5.23_112, vect__5.23_112, { 1, 0 }>;
  _5 = *_4;
  _7 = c.2_6 + _3;
  vect__8.27_121 = MEM[(long int *)vectp.25_119];
  vect__8.28_122 = VEC_PERM_EXPR <vect__8.27_121, vect__8.27_121, { 1, 0 }>;
  _8 = *_7;
  vect__9.29_123 = vect__5.24_113 | vect__8.28_122;
  _9 = _5 | _8;
  vect__9.32_131 = VEC_PERM_EXPR <vect__9.29_123, vect__9.29_123, { 1, 0 }>;
  MEM[(long int *)vectp.30_129] = vect__9.32_131;
  _10 = b.7_18 + -1;
  vectp.21_111 = vectp.21_110 + 18446744073709551600;
  vectp.25_120 = vectp.25_119 + 18446744073709551600;
  vectp.30_130 = vectp.30_129 + 18446744073709551600;
  ivtmp_134 = ivtmp_133 + 1;
  if (ivtmp_134 < bnd.18_99)
    goto <bb 11>; [83.34%]
  else
    goto <bb 12>; [16.66%]

The unnecessary VEC_PERM_EXPRs (element reversals) could be removed after
vectorization by scanning for closed computations in each iteration having
these properties:

 - All loads feed into an element reversal and nothing else;
 - All stores are fed by an element reversal and nothing else;
 - All intermediate computations are lane-insensitive;
 - The computation has no other inputs than the element-reversed loads; and
 - The computation has no other outputs than the element-reversed stores.

This is quite similar to the requirements of the analyze_swaps() pass in the
POWER back end (rs6000.c), which is somewhat more general in that closed
computations aren't limited to a single loop iteration but can contain general
control flow.

Reply via email to