https://gcc.gnu.org/bugzilla/show_bug.cgi?id=63341

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |4.8.4
            Summary|[4.8/4.9/5                  |[4.8/4.9/5 Regression]
                   |RegressionVectorizer        |Vectorization
                   |                            |miscompilation with
                   |                            |-mcpu=power7

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
The following testcase is miscompiled on powerpc64-linux with -O2
-ftree-vectorize -m32 -mcpu=power7 -mtune=power7:
typedef union U { unsigned short s; unsigned char c; } __attribute__((packed))
U;
struct S { char e __attribute__((aligned (16))); U s[32]; };
struct S t = {0, {{1}, {2}, {3}, {4}, {5}, {6}, {7}, {8},
  {9}, {10}, {11}, {12}, {13}, {14}, {15}, {16},
  {17}, {18}, {19}, {20}, {21}, {22}, {23}, {24},
  {25}, {26}, {27}, {28}, {29}, {30}, {31}, {32}}};
unsigned short d[32];

int
main ()
{
  int i;
  for (i = 0; i < 32; i++)
    d[i] = t.s[i].s;
  if (__builtin_memcmp (d, t.s, sizeof d))
    __builtin_abort ();
  return 0;
}

The problem as I see is that the loop is vectorized with
dr_explicit_realign_optimized, and calls vect_create_data_ref_ptr with offset
of TYPE_VECT_SUBPARTS (type) - 1, which is 7 in this case.  This function
then calls vect_create_addr_base_for_vector_ref which multiplies the offset
by step (2 in this case), and thus adds 14 bytes to the base address.  The
base address is &t + 1 and &t is 16 bytes aligned, so it loads the first part
from (&t + 1) & -16 (correct) and the second part from (&t + 1 + 14) & -16,
which is still wrong, because (&t + 15) & -16 is still &t, not &t + 16 we want
to use.  The problem is that as the offset given to vect_create_data_ref_ptr is
not in bytes, but in number of subparts, after multiplying it by step we end up
with offset that doesn't have the lowest bit(s) set, which is required for this
kind of realignment.  Richard, thoughts on this?  Either we'd need the offset
always already in bytes and let all the callers ensure it is in bytes, or some
flag whether offset is in bytes or units, or some flag to add not just
offset * step, but offset * step + (step - 1), or do that always.

Reply via email to