https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117031
Bug ID: 117031 Summary: increasing VF during SLP vectorization permutes unnecessarily Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: tnfchris at gcc dot gnu.org Target Milestone: --- The following testcase: --- void test1 (unsigned short *x, double *y, int n) { for (int i = 0; i < n; i++) { unsigned short a = x[i * 4 + 0]; unsigned short b = x[i * 4 + 1]; unsigned short c = x[i * 4 + 2]; unsigned short d = x[i * 4 + 3]; y[i] = (double)a + (double)b + (double)c + (double)d; } } --- at -O3 vectorizes using LOAD_LANES on aarch64: vect_array.11 = .LOAD_LANES (MEM <short unsigned int[32]> [(short unsigned int *)vectp_x.9_123]); vect_a_29.12_125 = vect_array.11[0]; vect__14.17_129 = [vec_unpack_lo_expr] vect_a_29.12_125; vect__14.17_130 = [vec_unpack_hi_expr] vect_a_29.12_125; vect__14.16_131 = [vec_unpack_lo_expr] vect__14.17_129; vect__14.16_132 = [vec_unpack_hi_expr] vect__14.17_129; vect__14.16_133 = [vec_unpack_lo_expr] vect__14.17_130; vect__14.16_134 = [vec_unpack_hi_expr] vect__14.17_130; vect__14.18_135 = (vector(2) double) vect__14.16_131; vect__14.18_136 = (vector(2) double) vect__14.16_132; vect__14.18_137 = (vector(2) double) vect__14.16_133; vect__14.18_138 = (vector(2) double) vect__14.16_134; ... because input type is 4 shorts, so V4HI is the natural size. V4HI fails to vectorize because we don't support direct conversion from V4HI to V4SI. We then pick a higher VF (V8HI) and the loads are detected as interleaving. LLVM however avoids the permute here by detecting that the unrolling doesn't result in a permuted access as it's equivalent to: void test3 (unsigned short *x, double *y, int n) { for (int i = 0; i < n; i+=2) { unsigned short a1 = x[i * 4 + 0]; unsigned short b1 = x[i * 4 + 1]; unsigned short c1 = x[i * 4 + 2]; unsigned short d1 = x[i * 4 + 3]; y[i+0] = (double)a1 + (double)b1 + (double)c1 + (double)d1; unsigned short a2 = x[i * 4 + 4]; unsigned short b2 = x[i * 4 + 5]; unsigned short c2 = x[i * 4 + 6]; unsigned short d2 = x[i * 4 + 7]; y[i+1] = (double)a2 + (double)b2 + (double)c2 + (double)d2; } } GCC seems to miss that there is no gap between the group accesses and that stride == 1. test3 is vectorized linearly by GCC, so it seems this is missed optimization in data ref analysis?