https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125523
Bug ID: 125523
Summary: [AArch64]GCC fails to SLP vectorize 4-int horizontal
reduction when mixed with other scalar operations in
same loop
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: bug_hunters at yeah dot net
Target Milestone: ---
**Description:**
The test case contains a loop that accumulates values from two different struct
arrays:
- One `long` value from `a[idx*3].m0`
- Four `int` values from `b[idx*3].m0` through `b[idx*3].m3`
The four `int` members of `b` are consecutive in memory (16 bytes total) and
are a natural fit for SIMD horizontal reduction: load all four with a single
128-bit load, then sum them with a single SIMD instruction.
**Clang** successfully vectorizes this pattern:
- Loads all four `int` members with `ldr q0`
- Uses `saddlv d0, v0.4s` to sum them horizontally in one instruction
- Handles the `a` member as a separate scalar operation
**GCC** fails to vectorize the four `int` reduction:
- Loads `b.m0` and `b.m1` with `ldrsw` (scalar)
- Uses `ldpsw` to load `b.m1` and `b.m2` together (memory optimization only)
- Loads `b.m3` separately
- Performs scalar additions for each member
The presence of `a.m0` (a scalar operation from a different struct) appears to
prevent GCC's SLP vectorizer from recognizing and packing the four consecutive
`int` members of `b`.
**Test case:**
```c
#include <stdint.h>
#include <stddef.h>
typedef struct {
long m0;
} element_t_0;
typedef struct {
int m0;
int m1;
int m2;
int m3;
} element_t_1;
double foo(
const element_t_0 * __restrict__ a,
const element_t_1 * __restrict__ b,
int n, int m, int l) {
long sum = 0;
for (int i = n - 1; i >= 0; i -= 1) {
for (int j = m - 1; j >= 0; j -= 4) {
for (int k = l - 1; k >= 0; k -= 1) {
int idx = (i * m + j) * l + k;
sum += (long)a[(idx * 3)].m0;
sum += (long)b[(idx * 3)].m0;
sum += (long)b[(idx * 3)].m1;
sum += (long)b[(idx * 3)].m2;
sum += (long)b[(idx * 3)].m3;
if ((b[idx].m0 < 0)) {
break;
}
}
}
}
return (double)sum;
}
```
**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental)
```
**Compilation options:**
```
-march=armv9-a+sve -ftree-vectorize -O3 -fopt-info-vec-all -fno-trapping-math
-fvect-cost-model=unlimited
```
**GCC assembly (key loop portion):**
```assembly
.L10:
ldr x7, [x9] ; a.m0 (scalar)
ldrsw x6, [x2] ; b.m0 (scalar)
add x3, x3, x7 ; add a.m0
add x3, x6, x3 ; add b.m0
ldpsw x7, x6, [x2, 4] ; b.m1 and b.m2 (load pair only, not
SIMD)
add x7, x7, x3 ; add b.m1
add x6, x6, x7 ; add b.m2
ldrsw x3, [x2, 60] ; b.m3 (scalar)
add x3, x3, x6 ; add b.m3
```
Note: GCC uses `ldpsw` to load two 32-bit values together, but this is a memory
pairing optimization, not SIMD vectorization. Each value is still added
separately.
Also reproducible on Godbolt:
https://godbolt.org/z/6Yh1jqc8a
**Clang version:**
```
clang version 22.1.4
Target: aarch64-unknown-linux-gnu
```
**Clang compilation options:**
```
-S -O3 -ftree-vectorize -ftree-slp-vectorize --target=aarch64-linux-gnu
-march=armv9-a+sve -Rpass=.*vectorize.* -Rpass-missed=.*vectorize.*
-Rpass-analysis=.*vectorize.*
```
**Clang assembly (key vectorized portion):**
```assembly
.LBB0_7:
ldr q0, [x19], #-48 ; load b.m0,b.m1,b.m2,b.m3 into SIMD
register
ldr x22, [x20], #-24 ; load a.m0 (scalar)
saddlv d0, v0.4s ; horizontal sum of all 4 ints
fmov x23, d0
add x9, x23, x9 ; add b sum to total
add x9, x9, x22 ; add a.m0 to total
```
Also reproducible on Godbolt:
https://godbolt.org/z/5x3cnEox5
**Additional notes:**
1. **The missed optimization:** Four consecutive `int` members (`m0` through
`m3`) of `b` occupy 16 contiguous bytes and can be loaded with a single 128-bit
SIMD load, then summed horizontally with `saddlv`. This is a 4x reduction in
load and add instructions.
2. **When GCC succeeds:** GCC successfully vectorizes this pattern when
**only** the four `b` members are present (no `a.m0` in the loop body).
3. **When GCC fails:** Adding a single scalar operation from a different struct
(`a.m0`) prevents GCC from recognizing the SIMD opportunity on `b`'s members.
GCC only uses `ldpsw` (load pair) as a memory optimization, not full SIMD.
4. **Clang handles mixed sources correctly:** Clang independently vectorizes
the four `b` members while keeping `a.m0` as a scalar operation.