https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125520

            Bug ID: 125520
           Summary: [ARM64] Failed to SLP vectorize adjacent int-to-float
                    conversion inside if-statement in loop with early exit
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bug_hunters at yeah dot net
  Target Milestone: ---

**Description:**

The test case involves a loop that loads two adjacent `int32_t` members (`m0`,
`m1`) from a struct array (`element_t_0`), converts them to `float`, adds them
together, and then adds a third `float` loaded from another struct
(`element_t_1`).

GCC currently:
1. Reports unsupported control flow in loop.
2. Uses `ldp` to load both integers efficiently (good)
3. Performs **separate scalar** `scvtf` conversions for each integer (not
vectorized)
4. Uses scalar `fadd` for the addition (not vectorized)

Clang successfully vectorizes this pattern by:
1. Loading both integers into a 64-bit register (`ldr d0`)
2. Using SIMD conversion (`scvtf v0.2s`) to convert both to `float`
simultaneously
3. Using SIMD horizontal addition (`faddp`) to sum the two floats

Disabling the cost model (`-fno-vect-cost-model`) does **not** enable
vectorization in GCC, indicating the issue is not cost-related but rather a
missing capability in GCC's SLP vectorizer for this specific pattern.

**Test case:**
```c
#include <stdint.h>

typedef struct {
    int32_t m0;
    int32_t m1;
} element_t_0;

typedef struct {
    int32_t m0;
} element_t_1;

float foo(
    const element_t_0 * __restrict__ a,
    const element_t_1 * __restrict__ b,
    float * __restrict__ out,
    int n, int m) {
    for (int j = 0; j < m; j += 2) {
        for (int i = 0; i < n; i += 1) {
            int idx = j * m + i;
            if ((i + 1) % 3 == 0) {
                out[idx] = (((float)a[(idx + 15)].m0) +
                            ((float)a[(idx + 15)].m1) +
                            ((float)b[(idx + 15)].m0));
            }
            if ((a[idx].m0 < -1)) {
                break;
            }
        }
    }
    return (float)0;
}
```

**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental) [trunk]
```

**Compilation options:**
```
-march=armv9-a+sve -ftree-vectorize -O3 -fopt-info-vec-all
```

**GCC output:**
```
<source>:17:23: missed: couldn't vectorize loop
<source>:17:23: missed: not vectorized: unsupported control flow in loop.
<source>:19:27: missed: couldn't vectorize loop
<source>:19:27: missed: not vectorized: unsupported control flow in loop.
<source>:11:7: note: vectorized 0 loops in function.
```

**GCC output (key loop portion):**
```assembly
.L8:
        ldp     s0, s30, [x6, 120]
        add     x6, x6, 8
        ldr     s31, [x11, x5, lsl 2]
        ldr     w8, [x6, -8]
        scvtf   s0, s0
        scvtf   s30, s30
        scvtf   s31, s31
        fadd    s30, s0, s30
        fadd    s31, s30, s31
        str     s31, [x9, x5, lsl 2]
```

Also reproducible on Godbolt:
https://godbolt.org/z/Pcej7c66f

**Clang version:**
```
clang version 22.1.4
Target: aarch64-unknown-linux-gnu
```

**Compilation options:**
```
-S -O3 -ftree-vectorize -ftree-slp-vectorize --target=aarch64-linux-gnu
-march=armv9-a+sve -ftree-slp-vectorize -Rpass=.*vectorize.* 
-Rpass-missed=.*vectorize.*  -Rpass-analysis=.*vectorize.*
```

**Key vectorized portion:**
```assembly
        ldr     d0, [x18, #120]
        scvtf   v0.2s, v0.2s
        faddp   s0, v0.2s
        ldr     s1, [x13, x15, lsl #2]
        scvtf   s1, s1
        fadd    s0, s0, s1
```

Also reproducible on Godbolt:
https://godbolt.org/z/v5vPc7bo7

**Additional notes:**

 **Actual behavior:** GCC only optimizes the memory access (`ldp`) but keeps
the arithmetic scalar.

**Cost model is not the issue:** Adding `-fno-vect-cost-model` does not change
the generated code, confirming the limitation is in the vectorization
capability itself.

**Pattern summary:** The missed optimization is:
   - Two adjacent 32-bit integer loads from the same struct
   - Conversion to float (both)
   - Addition of the two converted values (horizontal operation)

**Floating-point trapping** is not the issue: Adding -fno-trapping-math (which
disables assumptions about FP exceptions) also does not enable vectorization.
This rules out concerns about scvtf or fadd potentially generating trapping
exceptions as the blocking factor.

Reply via email to