http://gcc.gnu.org/bugzilla/show_bug.cgi?id=52298

Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|ASSIGNED                    |NEW
         AssignedTo|jakub at gcc dot gnu.org    |unassigned at gcc dot
                   |                            |gnu.org

--- Comment #2 from Jakub Jelinek <jakub at gcc dot gnu.org> 2012-02-20 
08:57:40 UTC ---
Unassigning myself, not familiar enough with the nested_in_vect_loop stuff,
Richard, could you please have a look at this?
Just random comments:

Shorter testcase (-O3):
int a, b, i[2];

int
foo (void)
{
  int l = 0;

  for (a = 0; a <= 3; a++)
    for (b = 1; b >= 0; b--)
      l |= i[b];
  return l;
}

The immediate problem I see is that in vectorizable_load we initialize
  negative = tree_int_cst_compare (DR_STEP (dr), size_zero_node) < 0;
i.e. from the inner loop's step (which is really negative), while we actually
use STMT_VINFO_DR_STEP when deciding if it is invariant and how to adjust the
read in each loop.  STMT_VINFO_DR_STEP is 0 here, i.e. invariant, and
vectorizable_load in that case inserts stmts after the original stmt, but as
negative is true, we then put a permutation of it (which doesn't make any sense
for invariant loads, as all the vector elements are the same) before the
original stmt.  So, a quick hack of ignoring negative if inv_p && !bb_vinfo
fixes this, but can't be the right fix.  I guess negative flag should be taken
from STMT_VINFO_DR_STEP if nested_in_vect_loop, but not sure if always, and the
question is what to do with the testing for the negative case, which needs to
pass a dr around, and for the outer loop we don't have any dr with the right
step etc.

I think the testcase shows several other issues, which can be observed also on
(-O3):

int a, b, i[2];

int
foo (void)
{
  int l = 0;

  for (a = 0; a <= 3; a++)
    for (b = 0; b <= 1; b++)
      l |= i[b];
  return l;
}

1) it is unfortunately cunrolli pass doesn't do anything here, then the
vectorizer wouldn't make such weird decisions; but when cunrolli is run, lim
hasn't moved out the b variable loads/stores from the loop yet, and the next
complete unrolling happens only after vectorization.  Perhaps another inner
loop complete unrolling a couple of passes before vectorization would help
here, but the question is if it would help even real-world apps.

2) if the outer loop has zero step and inner non-zero step, the question is if
it is worthwhile to do the vectorization at all, and at least on the above
testcase the cost model should say that it can't be beneficial given the large
cost of the reduction for a single iteration.

3) after complete unrolling, we end up with:
vect_var_.18_27 = vect_cst_.17_14 | { 0, 0, 0, 0 };
but nothing at the tree level optimizes that away into just
vect_var_.18_27 = vect_cst_.17_14;

So, the resulting assembly:
        movd    i(%rip), %xmm0
        movl    $2, b(%rip)
        movd    i+4(%rip), %xmm2
        movl    $4, a(%rip)
        pshufd  $0, %xmm0, %xmm1
        pshufd  $0, %xmm2, %xmm0
        por     %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        psrldq  $8, %xmm1
        por     %xmm1, %xmm0
        movdqa  %xmm0, %xmm1
        psrldq  $4, %xmm1
        por     %xmm1, %xmm0
        movd    %xmm0, -12(%rsp)
        movl    -12(%rsp), %eax
        ret
really can't be faster than
        movl    i+4(%rip), %edx
        movl    i(%rip), %eax
        movl    $2, b(%rip)
        movl    $4, a(%rip)
        orl     %edx, %eax
        orl     %edx, %eax
        ret
(but even for the non-vectorized loop, we should optimize away the second orl).

While playing with this, I ended up with a wrong-code testcase as opposed to
ice-on-valid:

/* { dg-options "-O1 -ftree-vectorize -fno-tree-pre -fno-tree-loop-im" } */

extern void abort (void);
int c[80];

__attribute__((noinline)) int
foo (void)
{
  int l = 0;
  int a, b;

  for (a = 3; a >= 0; a--)
    for (b = 7; b >= 0; b--)
      l |= c[a+60];
  return l;
}

int
main ()
{
  int i;
  for (i = 0; i < 60; i++)
    c[i] = 1;
  for (; i < 64; i++)
    c[i] = 1 << (i - 59);
  if (foo () != 30)
    abort ();
  return 0;
}

And last minor thing (shouldn't we have a warning for it, at least static code
analyzers should and usually do warn about it):

vect_analyze_data_ref_access does:
  tree step = DR_STEP (dr);
...
  HOST_WIDE_INT dr_step = TREE_INT_CST_LOW (step);
...
  if (loop_vinfo && !step)
    {
      if (vect_print_dump_info (REPORT_DETAILS))
        fprintf (vect_dump, "bad data-ref access in loop");
      return false;
    }
So, either the if should go, because we'd segfault if step is NULL, or
initialization of dr_step needs to be moved after this test.

Reply via email to