https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125247
Bug ID: 125247
Summary: SVE vectorization fails for loop with distance -8
dependence and mixed-type struct — legality analysis
bailout, not cost-model
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: bug_hunters at yeah dot net
Target Milestone: ---
**Description:**
GCC trunk fails to vectorize a loop containing an inter-iteration dependence of
distance 8 when the loop body involves type conversions from a mixed-type
struct (short, short, long) to float on AArch64 with SVE. The vectorizer bails
out during **legality/dependence analysis**, before reaching the cost model,
emitting `Analysis failed with vector mode VNx8HI` and `Skipping vector mode
VNx16QI, which would repeat the analysis for VNx8HI`. Even with
`-fvect-cost-model=unlimited`, no vector code is generated.
In contrast, Clang 22.1.4 vectorizes the same loop successfully with NEON
(width 4), even when forced via `#pragma clang loop vectorize(enable)` —
confirming the loop is semantically vectorizable and the issue is in GCC's
legality analysis phase, not a cost-model rejection.
**Test case:**
```c
typedef struct {
short m0;
short m1;
long m2;
} element_t_0;
float foo(const element_t_0 *__restrict__ a,
float *__restrict__ out,
int n)
{
for (int i = 0; i < n; i += 1) {
out[i] = ((i >= 8) ? out[i - 8] : 0.0f)
+ ((float)a[i].m0 + (float)a[i].m1 + (float)a[i].m2);
}
return 0.0f;
}
```
**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental) [trunk]
```
**Compilation options:**
```
-O3 -march=armv9-a+sve -ftree-vectorize -fopt-info-vec-all
-fvect-cost-model=unlimited -fno-trapping-math
```
**GCC trunk output:**
```
<source>:14:23: missed: couldn't vectorize loop
<source>:14:23: missed: unsupported SLP instances
<source>:14:23: missed: couldn't vectorize loop
<source>:14:23: missed: unsupported SLP instances
<source>:8:7: note: vectorized 0 loops in function.
<source>:18:12: note: ***** Analysis failed with vector mode VNx8HI
<source>:18:12: note: ***** Skipping vector mode VNx16QI, which would repeat
the analysis for VNx8HI
```
Generated assembly (scalar only — full loop body remains unvectorized):
```assembly
foo:
cmp w2, 0
ble .L2
...
.L3:
ldrsh w6, [x0, 2]
ldrsh w5, [x0, -16]
ldr x4, [x0, -8]
scvtf s1, w6
scvtf s0, w5
ldr s30, [x7, x3, lsl 2]
scvtf s31, x4
fadd s0, s1, s0
fadd s31, s0, s31
fadd s30, s31, s30
str s30, [x1, x3, lsl 2]
add x3, x3, 1
cmp w2, w3
bgt .L3
.L2:
movi v0.2s, #0
ret
```
Also reproducible on Godbolt: https://godbolt.org/z/nGrr6PnMd
**Clang 22.1.4 with `#pragma clang loop vectorize(enable)` (for comparison):**
Clang output:
```
<source>:12:5: remark: Max legal vector width too small, scalable vectorization
unfeasible. [-Rpass-analysis]
<source>:12:5: remark: the cost-model indicates that interleaving is not
beneficial [-Rpass-analysis=loop-vectorize]
<source>:12:5: remark: vectorized loop (vectorization width: 4, interleaved
count: 1) [-Rpass=loop-vectorize]
<source>:15:65: remark: List vectorization was possible but not beneficial with
cost 0 >= 0 [-Rpass-missed=slp-vectorizer]
```
Key vectorized portion (NEON, width 4, `#pragma` overrides cost-model):
```assembly
.LBB0_4:
ldp q7, q6, [x10]
ldr s4, [x10, #40]
ldr d5, [x10, #24]
ldur d16, [x10, #-8]
zip1 v4.4h, v5.4h, v4.4h
zip1 v6.2d, v7.2d, v6.2d
ldr d7, [x10, #8]
zip1 v17.4h, v16.4h, v7.4h
trn2 v7.4h, v16.4h, v7.4h
...
sshll v7.4s, v7.4h, #0
scvtf v7.4s, v7.4s
sshll v4.4s, v17.4h, #0
scvtf v4.4s, v4.4s
...
fadd v4.4s, v4.4s, v7.4s
fadd v4.4s, v4.4s, v6.4s
ld1w { z5.s }, p1/z, [x12, x11, lsl #2]
fadd v4.4s, v5.4s, v4.4s
str q4, [x12], #16
b.ne .LBB0_4
```
Also reproducible on Godbolt: https://godbolt.org/z/aMTKEsjvj
**Additional notes:**
- The loop has a read-after-write dependence of distance 8 (`out[i]` reads
`out[i-8]` written by an earlier iteration), which is a known-safe distance for
vectorization. Both the distance and direction are compile-time constants.
- The struct layout `{short, short, long}` contains alignment padding and
requires gather-like operations plus integer-to-float conversions (scvtf) to
pack into float vectors.
- Clang, when forced via `#pragma clang loop vectorize(enable)`, successfully
vectorizes the loop with NEON (width 4). This **confirms the loop is
semantically legal to vectorize** — the dependence distance of 8 does not
violate any data dependence constraints. The pragma overrides any cost-model
hesitation, and Clang's legality analysis passes.
- By contrast, GCC's failure occurs **before reaching the cost model**, as
evidenced by:
- `Analysis failed with vector mode VNx8HI` — a legality/dependence analysis
internal error, not a cost-model rejection.
- `-fvect-cost-model=unlimited` has no effect on the outcome.
- No fallback to fixed-width NEON vectorization is attempted, despite NEON
being available under `-march=armv9-a+sve`.
- The bailout at `VNx8HI` and the explicit skip of `VNx16QI` with the message
"would repeat the analysis" suggest a structural issue in the SVE dependence
analyzer when handling mixed-type struct fields, rather than a mode-specific
limitation.