https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79934
Bug ID: 79934 Summary: Vectorization of descending-index loops can produce unnecessary permutes Product: gcc Version: 7.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: wschmidt at gcc dot gnu.org CC: anton at samba dot org, rguenth at gcc dot gnu.org Target Milestone: --- Consider long *a, *c; int b; void d() { for (; b; b--) a[b] |= c[b]; } On POWER9, the core vectorized loop for this looks like this: .L8: lxvx 0,6,9 // load a[b] and a[b-1] lxvx 12,8,9 // load c[b] and c[b-1] xxpermdi 0,0,0,2 // swap elements of a xxpermdi 12,12,12,2 // swap elements of c xxlor 0,0,12 // produce a | c xxpermdi 0,0,0,2 // swap elements of result stxvx 0,6,9 // store a[b] and a[b-1] addi 9,9,-16 bdnz .L8 The xxpermdi instructions are reversing the order of vector elements in a register. However, the only operation being performed on these values is "lane-insensitive." Since the swaps occur on all inputs and outputs to the computation, and none of the values escape the iteration, performing the operation in the "wrong" lanes does not affect the result. Better code would be: .L8: lxvx 0,6,9 // load a[b] and a[b-1] lxvx 12,8,9 // load c[b] and c[b-1] xxlor 0,0,12 // produce a | c in "wrong" lanes stxvx 0,6,9 // store a[b] and a[b-1] addi 9,9,-16 bdnz .L8 This code is introduced by the vectorizer: <bb 10> [55.08%]: # b.7_18 = PHI <_10(11), b.7_78(9)> # vectp.21_110 = PHI <vectp.21_111(11), vectp.22_104(9)> # vectp.25_119 = PHI <vectp.25_120(11), vectp.26_114(9)> # vectp.30_129 = PHI <vectp.30_130(11), vectp.31_124(9)> # ivtmp_133 = PHI <ivtmp_134(11), 0(9)> _2 = (long unsigned int) b.7_18; _3 = _2 * 8; _4 = a.0_1 + _3; vect__5.23_112 = MEM[(long int *)vectp.21_110]; vect__5.24_113 = VEC_PERM_EXPR <vect__5.23_112, vect__5.23_112, { 1, 0 }>; _5 = *_4; _7 = c.2_6 + _3; vect__8.27_121 = MEM[(long int *)vectp.25_119]; vect__8.28_122 = VEC_PERM_EXPR <vect__8.27_121, vect__8.27_121, { 1, 0 }>; _8 = *_7; vect__9.29_123 = vect__5.24_113 | vect__8.28_122; _9 = _5 | _8; vect__9.32_131 = VEC_PERM_EXPR <vect__9.29_123, vect__9.29_123, { 1, 0 }>; MEM[(long int *)vectp.30_129] = vect__9.32_131; _10 = b.7_18 + -1; vectp.21_111 = vectp.21_110 + 18446744073709551600; vectp.25_120 = vectp.25_119 + 18446744073709551600; vectp.30_130 = vectp.30_129 + 18446744073709551600; ivtmp_134 = ivtmp_133 + 1; if (ivtmp_134 < bnd.18_99) goto <bb 11>; [83.34%] else goto <bb 12>; [16.66%] The unnecessary VEC_PERM_EXPRs (element reversals) could be removed after vectorization by scanning for closed computations in each iteration having these properties: - All loads feed into an element reversal and nothing else; - All stores are fed by an element reversal and nothing else; - All intermediate computations are lane-insensitive; - The computation has no other inputs than the element-reversed loads; and - The computation has no other outputs than the element-reversed stores. This is quite similar to the requirements of the analyze_swaps() pass in the POWER back end (rs6000.c), which is somewhat more general in that closed computations aren't limited to a single loop iteration but can contain general control flow.