https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118471
Bug ID: 118471
Summary: Missed folding of descriptor span field for contiguous
Fortran pointers
Product: gcc
Version: 15.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: enhancement
Priority: P3
Component: fortran
Assignee: unassigned at gcc dot gnu.org
Reporter: rsandifo at gcc dot gnu.org
Target Milestone: ---
[I'm filing this speculatively because I don't know whether the optimisation is
valid. Sorry in advance if it's not.]
For:
subroutine foo(a, n)
real(kind=8), pointer, contiguous :: a(:)
integer :: i, n
do i = 1, n
a(i) = 1.0
end do
end subroutine foo
the contiguous pointer means that we assume that the stride is 1 in:
a.data + (((i * a.stride + a.offset) * a.span)
But we still treat the span as variable. Are there any cases in which it can't
be 8 (the size of the real)?
This means that, before vectorisation, we have:
<bb 3> [local count: 105119324]:
_1 = *a_14(D).data;
_2 = *a_14(D).offset;
_5 = *a_14(D).span;
<bb 4> [local count: 955630224]:
# i_19 = PHI <i_16(6), 1(3)>
_3 = (integer(kind=8)) i_19;
_4 = _2 + _3;
_6 = _4 * _5;
_7 = (sizetype) _6;
_8 = _1 + _7;
MEM[(real(kind=8) *)_8] = 1.0e+0;
i_16 = i_19 + 1;
if (_13 < i_16)
goto <bb 7>; [11.00%]
else
goto <bb 6>; [89.00%]
<bb 6> [local count: 850510900]:
goto <bb 4>; [100.00%]
and so we analyse the access as strided rather than contiguous:
analyze_innermost: success.
base_address: _1 + (sizetype) ((_2 + 1) * _5)
offset from base address: 0
constant offset from base address: 0
----> step: (ssizetype) _5
base alignment: 8
base misalignment: 0
offset alignment: 128
step alignment: 1
base_object: MEM[(real(kind=8) *)_1 + (sizetype) ((_2 + 1) * _5)]
Access function 0: {0B, +, (sizetype) _5}_1
The result is that for aarch64 we generate a scatter store rather than a
contiguous store:
foo_:
.LFB0:
.cfi_startproc
ldr w2, [x1]
cmp w2, 0
ble .L1
ldp x5, x1, [x0, 32]
whilelo p7.d, wzr, w2
fmov z30.d, #1.0e+0
cntd x3
mul x6, x1, x5
index z31.d, #0, x6
mul x4, x6, x3
ldp x0, x6, [x0]
add x1, x1, x6
madd x1, x1, x5, x0
mov x0, 0
.p2align 5,,15
.L3:
st1d z30.d, p7, [x1, z31.d]
add x0, x0, x3
add x1, x1, x4
whilelo p7.d, w0, w2
b.any .L3
.L1:
ret
This is in contrast to:
foo_:
.LFB0:
.cfi_startproc
ldr w2, [x1]
cmp w2, 0
ble .L1
mov x1, 0
cntd x3
whilelo p7.d, wzr, w2
fmov z31.d, #1.0e+0
.p2align 5,,15
.L3:
st1d z31.d, p7, [x0, x1, lsl 3]
add x1, x1, x3
whilelo p7.d, w1, w2
b.any .L3
.L1:
ret
for an array that is known to be fully contiguous.