https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125520
Bug ID: 125520
Summary: [ARM64] Failed to SLP vectorize adjacent int-to-float
conversion inside if-statement in loop with early exit
Product: gcc
Version: 16.0
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: target
Assignee: unassigned at gcc dot gnu.org
Reporter: bug_hunters at yeah dot net
Target Milestone: ---
**Description:**
The test case involves a loop that loads two adjacent `int32_t` members (`m0`,
`m1`) from a struct array (`element_t_0`), converts them to `float`, adds them
together, and then adds a third `float` loaded from another struct
(`element_t_1`).
GCC currently:
1. Reports unsupported control flow in loop.
2. Uses `ldp` to load both integers efficiently (good)
3. Performs **separate scalar** `scvtf` conversions for each integer (not
vectorized)
4. Uses scalar `fadd` for the addition (not vectorized)
Clang successfully vectorizes this pattern by:
1. Loading both integers into a 64-bit register (`ldr d0`)
2. Using SIMD conversion (`scvtf v0.2s`) to convert both to `float`
simultaneously
3. Using SIMD horizontal addition (`faddp`) to sum the two floats
Disabling the cost model (`-fno-vect-cost-model`) does **not** enable
vectorization in GCC, indicating the issue is not cost-related but rather a
missing capability in GCC's SLP vectorizer for this specific pattern.
**Test case:**
```c
#include <stdint.h>
typedef struct {
int32_t m0;
int32_t m1;
} element_t_0;
typedef struct {
int32_t m0;
} element_t_1;
float foo(
const element_t_0 * __restrict__ a,
const element_t_1 * __restrict__ b,
float * __restrict__ out,
int n, int m) {
for (int j = 0; j < m; j += 2) {
for (int i = 0; i < n; i += 1) {
int idx = j * m + i;
if ((i + 1) % 3 == 0) {
out[idx] = (((float)a[(idx + 15)].m0) +
((float)a[(idx + 15)].m1) +
((float)b[(idx + 15)].m0));
}
if ((a[idx].m0 < -1)) {
break;
}
}
}
return (float)0;
}
```
**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental) [trunk]
```
**Compilation options:**
```
-march=armv9-a+sve -ftree-vectorize -O3 -fopt-info-vec-all
```
**GCC output:**
```
<source>:17:23: missed: couldn't vectorize loop
<source>:17:23: missed: not vectorized: unsupported control flow in loop.
<source>:19:27: missed: couldn't vectorize loop
<source>:19:27: missed: not vectorized: unsupported control flow in loop.
<source>:11:7: note: vectorized 0 loops in function.
```
**GCC output (key loop portion):**
```assembly
.L8:
ldp s0, s30, [x6, 120]
add x6, x6, 8
ldr s31, [x11, x5, lsl 2]
ldr w8, [x6, -8]
scvtf s0, s0
scvtf s30, s30
scvtf s31, s31
fadd s30, s0, s30
fadd s31, s30, s31
str s31, [x9, x5, lsl 2]
```
Also reproducible on Godbolt:
https://godbolt.org/z/Pcej7c66f
**Clang version:**
```
clang version 22.1.4
Target: aarch64-unknown-linux-gnu
```
**Compilation options:**
```
-S -O3 -ftree-vectorize -ftree-slp-vectorize --target=aarch64-linux-gnu
-march=armv9-a+sve -ftree-slp-vectorize -Rpass=.*vectorize.*
-Rpass-missed=.*vectorize.* -Rpass-analysis=.*vectorize.*
```
**Key vectorized portion:**
```assembly
ldr d0, [x18, #120]
scvtf v0.2s, v0.2s
faddp s0, v0.2s
ldr s1, [x13, x15, lsl #2]
scvtf s1, s1
fadd s0, s0, s1
```
Also reproducible on Godbolt:
https://godbolt.org/z/v5vPc7bo7
**Additional notes:**
**Actual behavior:** GCC only optimizes the memory access (`ldp`) but keeps
the arithmetic scalar.
**Cost model is not the issue:** Adding `-fno-vect-cost-model` does not change
the generated code, confirming the limitation is in the vectorization
capability itself.
**Pattern summary:** The missed optimization is:
- Two adjacent 32-bit integer loads from the same struct
- Conversion to float (both)
- Addition of the two converted values (horizontal operation)
**Floating-point trapping** is not the issue: Adding -fno-trapping-math (which
disables assumptions about FP exceptions) also does not enable vectorization.
This rules out concerns about scvtf or fadd potentially generating trapping
exceptions as the blocking factor.