[Bug target/125520] New: [ARM64] Failed to SLP vectorize adjacent int-to-float conversion inside if-statement in loop with early exit

bug_hunters at yeah dot net via Gcc-bugs Sun, 31 May 2026 07:47:15 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125520


            Bug ID: 125520
           Summary: [ARM64] Failed to SLP vectorize adjacent int-to-float
                    conversion inside if-statement in loop with early exit
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bug_hunters at yeah dot net
  Target Milestone: ---

**Description:**

The test case involves a loop that loads two adjacent `int32_t` members (`m0`,
`m1`) from a struct array (`element_t_0`), converts them to `float`, adds them
together, and then adds a third `float` loaded from another struct
(`element_t_1`).

GCC currently:
1. Reports unsupported control flow in loop.
2. Uses `ldp` to load both integers efficiently (good)
3. Performs **separate scalar** `scvtf` conversions for each integer (not
vectorized)
4. Uses scalar `fadd` for the addition (not vectorized)

Clang successfully vectorizes this pattern by:
1. Loading both integers into a 64-bit register (`ldr d0`)
2. Using SIMD conversion (`scvtf v0.2s`) to convert both to `float`
simultaneously
3. Using SIMD horizontal addition (`faddp`) to sum the two floats

Disabling the cost model (`-fno-vect-cost-model`) does **not** enable
vectorization in GCC, indicating the issue is not cost-related but rather a
missing capability in GCC's SLP vectorizer for this specific pattern.

**Test case:**
```c
#include <stdint.h>

typedef struct {
    int32_t m0;
    int32_t m1;
} element_t_0;

typedef struct {
    int32_t m0;
} element_t_1;

float foo(
    const element_t_0 * __restrict__ a,
    const element_t_1 * __restrict__ b,
    float * __restrict__ out,
    int n, int m) {
    for (int j = 0; j < m; j += 2) {
        for (int i = 0; i < n; i += 1) {
            int idx = j * m + i;
            if ((i + 1) % 3 == 0) {
                out[idx] = (((float)a[(idx + 15)].m0) +
                            ((float)a[(idx + 15)].m1) +
                            ((float)b[(idx + 15)].m0));
            }
            if ((a[idx].m0 < -1)) {
                break;
            }
        }
    }
    return (float)0;
}
```

**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental) [trunk]
```

**Compilation options:**
```
-march=armv9-a+sve -ftree-vectorize -O3 -fopt-info-vec-all
```

**GCC output:**
```
<source>:17:23: missed: couldn't vectorize loop
<source>:17:23: missed: not vectorized: unsupported control flow in loop.
<source>:19:27: missed: couldn't vectorize loop
<source>:19:27: missed: not vectorized: unsupported control flow in loop.
<source>:11:7: note: vectorized 0 loops in function.
```

**GCC output (key loop portion):**
```assembly
.L8:
        ldp     s0, s30, [x6, 120]
        add     x6, x6, 8
        ldr     s31, [x11, x5, lsl 2]
        ldr     w8, [x6, -8]
        scvtf   s0, s0
        scvtf   s30, s30
        scvtf   s31, s31
        fadd    s30, s0, s30
        fadd    s31, s30, s31
        str     s31, [x9, x5, lsl 2]
```

Also reproducible on Godbolt:
https://godbolt.org/z/Pcej7c66f

**Clang version:**
```
clang version 22.1.4
Target: aarch64-unknown-linux-gnu
```

**Compilation options:**
```
-S -O3 -ftree-vectorize -ftree-slp-vectorize --target=aarch64-linux-gnu
-march=armv9-a+sve -ftree-slp-vectorize -Rpass=.*vectorize.* 
-Rpass-missed=.*vectorize.*  -Rpass-analysis=.*vectorize.*
```

**Key vectorized portion:**
```assembly
        ldr     d0, [x18, #120]
        scvtf   v0.2s, v0.2s
        faddp   s0, v0.2s
        ldr     s1, [x13, x15, lsl #2]
        scvtf   s1, s1
        fadd    s0, s0, s1
```

Also reproducible on Godbolt:
https://godbolt.org/z/v5vPc7bo7

**Additional notes:**

 **Actual behavior:** GCC only optimizes the memory access (`ldp`) but keeps
the arithmetic scalar.

**Cost model is not the issue:** Adding `-fno-vect-cost-model` does not change
the generated code, confirming the limitation is in the vectorization
capability itself.

**Pattern summary:** The missed optimization is:
   - Two adjacent 32-bit integer loads from the same struct
   - Conversion to float (both)
   - Addition of the two converted values (horizontal operation)

**Floating-point trapping** is not the issue: Adding -fno-trapping-math (which
disables assumptions about FP exceptions) also does not enable vectorization.
This rules out concerns about scvtf or fadd potentially generating trapping
exceptions as the blocking factor.

[Bug target/125520] New: [ARM64] Failed to SLP vectorize adjacent int-to-float conversion inside if-statement in loop with early exit

Reply via email to