https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118471

            Bug ID: 118471
           Summary: Missed folding of descriptor span field for contiguous
                    Fortran pointers
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: enhancement
          Priority: P3
         Component: fortran
          Assignee: unassigned at gcc dot gnu.org
          Reporter: rsandifo at gcc dot gnu.org
  Target Milestone: ---

[I'm filing this speculatively because I don't know whether the optimisation is
valid.  Sorry in advance if it's not.]

For:

subroutine foo(a, n)
  real(kind=8), pointer, contiguous :: a(:)
  integer :: i, n

  do i = 1, n
    a(i) = 1.0
  end do
end subroutine foo

the contiguous pointer means that we assume that the stride is 1 in:

  a.data + (((i * a.stride + a.offset) * a.span)

But we still treat the span as variable.  Are there any cases in which it can't
be 8 (the size of the real)?

This means that, before vectorisation, we have:

  <bb 3> [local count: 105119324]:
  _1 = *a_14(D).data;
  _2 = *a_14(D).offset;
  _5 = *a_14(D).span;

  <bb 4> [local count: 955630224]:
  # i_19 = PHI <i_16(6), 1(3)>
  _3 = (integer(kind=8)) i_19;
  _4 = _2 + _3;
  _6 = _4 * _5;
  _7 = (sizetype) _6;
  _8 = _1 + _7;
  MEM[(real(kind=8) *)_8] = 1.0e+0;
  i_16 = i_19 + 1;
  if (_13 < i_16)
    goto <bb 7>; [11.00%]
  else
    goto <bb 6>; [89.00%]

  <bb 6> [local count: 850510900]:
  goto <bb 4>; [100.00%]

and so we analyse the access as strided rather than contiguous:

analyze_innermost: success.
        base_address: _1 + (sizetype) ((_2 + 1) * _5)
        offset from base address: 0
        constant offset from base address: 0
---->   step: (ssizetype) _5
        base alignment: 8
        base misalignment: 0
        offset alignment: 128
        step alignment: 1
        base_object: MEM[(real(kind=8) *)_1 + (sizetype) ((_2 + 1) * _5)]
        Access function 0: {0B, +, (sizetype) _5}_1

The result is that for aarch64 we generate a scatter store rather than a
contiguous store:

foo_:
.LFB0:
        .cfi_startproc
        ldr     w2, [x1]
        cmp     w2, 0
        ble     .L1
        ldp     x5, x1, [x0, 32]
        whilelo p7.d, wzr, w2
        fmov    z30.d, #1.0e+0
        cntd    x3
        mul     x6, x1, x5
        index   z31.d, #0, x6
        mul     x4, x6, x3
        ldp     x0, x6, [x0]
        add     x1, x1, x6
        madd    x1, x1, x5, x0
        mov     x0, 0
        .p2align 5,,15
.L3:
        st1d    z30.d, p7, [x1, z31.d]
        add     x0, x0, x3
        add     x1, x1, x4
        whilelo p7.d, w0, w2
        b.any   .L3
.L1:
        ret

This is in contrast to:

foo_:
.LFB0:
        .cfi_startproc
        ldr     w2, [x1]
        cmp     w2, 0
        ble     .L1
        mov     x1, 0
        cntd    x3
        whilelo p7.d, wzr, w2
        fmov    z31.d, #1.0e+0
        .p2align 5,,15
.L3:
        st1d    z31.d, p7, [x0, x1, lsl 3]
        add     x1, x1, x3
        whilelo p7.d, w1, w2
        b.any   .L3
.L1:
        ret

for an array that is known to be fully contiguous.

Reply via email to