https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63341
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |4.8.4 Summary|[4.8/4.9/5 |[4.8/4.9/5 Regression] |RegressionVectorizer |Vectorization | |miscompilation with | |-mcpu=power7 --- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> --- The following testcase is miscompiled on powerpc64-linux with -O2 -ftree-vectorize -m32 -mcpu=power7 -mtune=power7: typedef union U { unsigned short s; unsigned char c; } __attribute__((packed)) U; struct S { char e __attribute__((aligned (16))); U s[32]; }; struct S t = {0, {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8}, {9}, {10}, {11}, {12}, {13}, {14}, {15}, {16}, {17}, {18}, {19}, {20}, {21}, {22}, {23}, {24}, {25}, {26}, {27}, {28}, {29}, {30}, {31}, {32}}}; unsigned short d[32]; int main () { int i; for (i = 0; i < 32; i++) d[i] = t.s[i].s; if (__builtin_memcmp (d, t.s, sizeof d)) __builtin_abort (); return 0; } The problem as I see is that the loop is vectorized with dr_explicit_realign_optimized, and calls vect_create_data_ref_ptr with offset of TYPE_VECT_SUBPARTS (type) - 1, which is 7 in this case. This function then calls vect_create_addr_base_for_vector_ref which multiplies the offset by step (2 in this case), and thus adds 14 bytes to the base address. The base address is &t + 1 and &t is 16 bytes aligned, so it loads the first part from (&t + 1) & -16 (correct) and the second part from (&t + 1 + 14) & -16, which is still wrong, because (&t + 15) & -16 is still &t, not &t + 16 we want to use. The problem is that as the offset given to vect_create_data_ref_ptr is not in bytes, but in number of subparts, after multiplying it by step we end up with offset that doesn't have the lowest bit(s) set, which is required for this kind of realignment. Richard, thoughts on this? Either we'd need the offset always already in bytes and let all the callers ensure it is in bytes, or some flag whether offset is in bytes or units, or some flag to add not just offset * step, but offset * step + (step - 1), or do that always.