https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118471
Bug ID: 118471 Summary: Missed folding of descriptor span field for contiguous Fortran pointers Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: enhancement Priority: P3 Component: fortran Assignee: unassigned at gcc dot gnu.org Reporter: rsandifo at gcc dot gnu.org Target Milestone: --- [I'm filing this speculatively because I don't know whether the optimisation is valid. Sorry in advance if it's not.] For: subroutine foo(a, n) real(kind=8), pointer, contiguous :: a(:) integer :: i, n do i = 1, n a(i) = 1.0 end do end subroutine foo the contiguous pointer means that we assume that the stride is 1 in: a.data + (((i * a.stride + a.offset) * a.span) But we still treat the span as variable. Are there any cases in which it can't be 8 (the size of the real)? This means that, before vectorisation, we have: <bb 3> [local count: 105119324]: _1 = *a_14(D).data; _2 = *a_14(D).offset; _5 = *a_14(D).span; <bb 4> [local count: 955630224]: # i_19 = PHI <i_16(6), 1(3)> _3 = (integer(kind=8)) i_19; _4 = _2 + _3; _6 = _4 * _5; _7 = (sizetype) _6; _8 = _1 + _7; MEM[(real(kind=8) *)_8] = 1.0e+0; i_16 = i_19 + 1; if (_13 < i_16) goto <bb 7>; [11.00%] else goto <bb 6>; [89.00%] <bb 6> [local count: 850510900]: goto <bb 4>; [100.00%] and so we analyse the access as strided rather than contiguous: analyze_innermost: success. base_address: _1 + (sizetype) ((_2 + 1) * _5) offset from base address: 0 constant offset from base address: 0 ----> step: (ssizetype) _5 base alignment: 8 base misalignment: 0 offset alignment: 128 step alignment: 1 base_object: MEM[(real(kind=8) *)_1 + (sizetype) ((_2 + 1) * _5)] Access function 0: {0B, +, (sizetype) _5}_1 The result is that for aarch64 we generate a scatter store rather than a contiguous store: foo_: .LFB0: .cfi_startproc ldr w2, [x1] cmp w2, 0 ble .L1 ldp x5, x1, [x0, 32] whilelo p7.d, wzr, w2 fmov z30.d, #1.0e+0 cntd x3 mul x6, x1, x5 index z31.d, #0, x6 mul x4, x6, x3 ldp x0, x6, [x0] add x1, x1, x6 madd x1, x1, x5, x0 mov x0, 0 .p2align 5,,15 .L3: st1d z30.d, p7, [x1, z31.d] add x0, x0, x3 add x1, x1, x4 whilelo p7.d, w0, w2 b.any .L3 .L1: ret This is in contrast to: foo_: .LFB0: .cfi_startproc ldr w2, [x1] cmp w2, 0 ble .L1 mov x1, 0 cntd x3 whilelo p7.d, wzr, w2 fmov z31.d, #1.0e+0 .p2align 5,,15 .L3: st1d z31.d, p7, [x0, x1, lsl 3] add x1, x1, x3 whilelo p7.d, w1, w2 b.any .L3 .L1: ret for an array that is known to be fully contiguous.