https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80724

            Bug ID: 80724
           Summary: gcc.target/aarch64/pr62178.c failed because of r247885
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: amker at gcc dot gnu.org
  Target Milestone: ---

After r247885, test gcc.target/aarch64/pr62178.c failed as below:
    gcc.target/aarch64/pr62178.c scan-assembler ld1r\\t{v[0-9]+.

Firstly, innermost loop after ivopt is:

  <bb 12> [26.32%]:
  # vectp_b.12_66 = PHI <vectp_b.12_67(13), vectp_b.12_64(11)>
  # vect__5.16_70 = PHI <vect__5.16_71(13), { 0, 0, 0, 0 }(11)>
  # ivtmp.56_96 = PHI <ivtmp.56_97(13), ivtmp.56_98(11)>
  _102 = (void *) ivtmp.56_96;
  _2 = MEM[base: _102, offset: 4B];
  vect_cst__62 = {_2, _2, _2, _2};
  vect__3.14_68 = MEM[base: vectp_b.12_66, offset: 0B];
  vect__4.15_69 = vect_cst__62 * vect__3.14_68;
  vect__5.16_71 = vect__4.15_69 + vect__5.16_70;
  vectp_b.12_67 = vectp_b.12_66 + 124;
  ivtmp.56_97 = ivtmp.56_96 + 4;
  _112 = (vector(4) int *) ivtmp.68_106;
  if (vectp_b.12_67 != _112)
    goto <bb 13>; [96.66%]
  else
    goto <bb 14>; [3.34%]

  <bb 13> [25.44%]:
  goto <bb 12>; [100.00%]


Note candidate ivtmp.56_96 is shifted by 4, thus MEM[base: _102, offset: 4B] is
generated rather than:
  _2 = MEM[base: _102, offset: 0B];
Which combined with vect_cst__62 = {_2, _2, _2, _2}; ld1r can be used.
IVOPTs has no knowledge that MEM[base + 4] has different outcome to MEM[base]
in this case.

For this iv_use:
Group 0:
  Type: ADDRESS
  Use 0.0:
    At stmt:    _2 = a[i_27][k_29];
    At pos:     a[i_27][k_29]
    IV struct:
      Type:     int *
      Base:     (int *) (&a + ((sizetype) i_27 * 124 + 4))
      Step:     4
      Object:   (void *) &a
      Biv:      N
      Overflowness wrto loop niter:     Overflow
There are two candidates:
Candidate 13:
  Var befor: ivtmp.55
  Var after: ivtmp.55
  Incr POS: before exit test
  IV struct:
    Type:       unsigned long
    Base:       (unsigned long) (&a + ((sizetype) i_27 * 124 + 4))
    Step:       4
    Object:     (void *) &a
    Biv:        N
    Overflowness wrto loop niter:       Overflow
Applying pattern match.pd:1902, generic-match.c:9693
Candidate 14:
  Var befor: ivtmp.56
  Var after: ivtmp.56
  Incr POS: before exit test
  IV struct:
    Type:       unsigned long
    Base:       (unsigned long) (&a + (sizetype) i_27 * 124)
    Step:       4
    Object:     (void *) &a
    Biv:        N
    Overflowness wrto loop niter:       Overflow

The cost is as below:
<Candidate Costs>:
  cand  cost
  0     5
  1     5
  2     5
  3     5
  4     4
  5     5
  6     5
  7     5
  8     5
  9     5
  10    5
  11    5
  12    5
  13    6
  14    5
<Group-candidate Costs>:
Group 0:
  cand  cost    compl.  inv.expr.       inv.vars
  1     2       2       1;      NIL;
  2     2       2       2;      NIL;
  3     1       2       3;      NIL;
  13    0       0       NIL;    NIL;
  14    0       1       NIL;    NIL;

Note we choose cand_14 only because cost of cand_13 itself is higher than
cand_14.
This is because the loop iterates 30 times, and we have:
cand_13
  base: (unsigned long) (&a + ((sizetype) i_27 * 124 + 4))
  cost: 33 (before amortize against loop niter) / 30 = 1
cand_14
  base: (unsigned long) (&a + (sizetype) i_27 * 124)
  cost: 29 (before amortize against loop niter) / 30 = 0

Note, we are on the verge of loop niters.

With this ivopts issue, the inner most loop should have only one more
instruction.  Unfortunately before RTL combine, we have:
   74: r74:SI=[++r99:DI]
      REG_INC r99:DI
   75: r123:V4SI=[post r90:DI+=0x7c]
      REG_INC r90:DI
   77: r124:V4SI=vec_duplicate(r74:SI)
      REG_DEAD r74:SI
   78: r126:V4SI=r123:V4SI*r124:V4SI
      REG_DEAD r124:V4SI
      REG_DEAD r123:V4SI
   79: r93:V4SI=r93:V4SI+r126:V4SI
      REG_DEAD r126:V4SI
Combine pass tries to combine 77/78, rather than 78/79, like:
   74: r74:SI=[++r99:DI]
      REG_INC r99:DI
   75: r123:V4SI=[post r90:DI+=0x7c]
      REG_INC r90:DI
   77: NOTE_INSN_DELETED
   78: r126:V4SI=vec_duplicate(r74:SI)*r123:V4SI
      REG_DEAD r74:SI
      REG_DEAD r123:V4SI
   79: r93:V4SI=r93:V4SI+r126:V4SI
      REG_DEAD r126:V4SI

So it misses mul+add combination, but combined an pattern which has generate
two instructions:
        fmov    s3, w0  // 157  *movsi_aarch64/12       [length = 4]
        mul     v0.4s, v0.4s, v3.s[0]   // 78   *aarch64_mul3_elt_from_dupv4si 
[length = 4]

Reply via email to