https://gcc.gnu.org/bugzilla/show_bug.cgi?id=125247

            Bug ID: 125247
           Summary: SVE vectorization fails for loop with distance -8
                    dependence and mixed-type struct — legality analysis
                    bailout, not cost-model
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: bug_hunters at yeah dot net
  Target Milestone: ---

**Description:**

GCC trunk fails to vectorize a loop containing an inter-iteration dependence of
distance 8 when the loop body involves type conversions from a mixed-type
struct (short, short, long) to float on AArch64 with SVE. The vectorizer bails
out during **legality/dependence analysis**, before reaching the cost model,
emitting `Analysis failed with vector mode VNx8HI` and `Skipping vector mode
VNx16QI, which would repeat the analysis for VNx8HI`. Even with
`-fvect-cost-model=unlimited`, no vector code is generated.

In contrast, Clang 22.1.4 vectorizes the same loop successfully with NEON
(width 4), even when forced via `#pragma clang loop vectorize(enable)` —
confirming the loop is semantically vectorizable and the issue is in GCC's
legality analysis phase, not a cost-model rejection.

**Test case:**
```c
typedef struct {
    short m0;
    short m1;
    long  m2;
} element_t_0;

float foo(const element_t_0 *__restrict__ a,
          float *__restrict__ out,
          int n)
{
    for (int i = 0; i < n; i += 1) {
        out[i] = ((i >= 8) ? out[i - 8] : 0.0f)
               + ((float)a[i].m0 + (float)a[i].m1 + (float)a[i].m2);
    }
    return 0.0f;
}
```

**GCC version:**
```
aarch64-unknown-linux-gnu-gcc (GCC) 16.0.1 20260429 (experimental) [trunk]
```

**Compilation options:**
```
-O3 -march=armv9-a+sve -ftree-vectorize -fopt-info-vec-all
-fvect-cost-model=unlimited -fno-trapping-math
```

**GCC trunk output:**
```
<source>:14:23: missed: couldn't vectorize loop
<source>:14:23: missed: unsupported SLP instances
<source>:14:23: missed: couldn't vectorize loop
<source>:14:23: missed: unsupported SLP instances
<source>:8:7: note: vectorized 0 loops in function.
<source>:18:12: note: ***** Analysis failed with vector mode VNx8HI
<source>:18:12: note: ***** Skipping vector mode VNx16QI, which would repeat
the analysis for VNx8HI
```

Generated assembly (scalar only — full loop body remains unvectorized):
```assembly
foo:
        cmp     w2, 0
        ble     .L2
        ...
.L3:
        ldrsh   w6, [x0, 2]
        ldrsh   w5, [x0, -16]
        ldr     x4, [x0, -8]
        scvtf   s1, w6
        scvtf   s0, w5
        ldr     s30, [x7, x3, lsl 2]
        scvtf   s31, x4
        fadd    s0, s1, s0
        fadd    s31, s0, s31
        fadd    s30, s31, s30
        str     s30, [x1, x3, lsl 2]
        add     x3, x3, 1
        cmp     w2, w3
        bgt     .L3
.L2:
        movi    v0.2s, #0
        ret
```

Also reproducible on Godbolt: https://godbolt.org/z/nGrr6PnMd

**Clang 22.1.4 with `#pragma clang loop vectorize(enable)` (for comparison):**

Clang output:
```
<source>:12:5: remark: Max legal vector width too small, scalable vectorization
unfeasible. [-Rpass-analysis]
<source>:12:5: remark: the cost-model indicates that interleaving is not
beneficial [-Rpass-analysis=loop-vectorize]
<source>:12:5: remark: vectorized loop (vectorization width: 4, interleaved
count: 1) [-Rpass=loop-vectorize]
<source>:15:65: remark: List vectorization was possible but not beneficial with
cost 0 >= 0 [-Rpass-missed=slp-vectorizer]
```

Key vectorized portion (NEON, width 4, `#pragma` overrides cost-model):
```assembly
.LBB0_4:
        ldp     q7, q6, [x10]
        ldr     s4, [x10, #40]
        ldr     d5, [x10, #24]
        ldur    d16, [x10, #-8]
        zip1    v4.4h, v5.4h, v4.4h
        zip1    v6.2d, v7.2d, v6.2d
        ldr     d7, [x10, #8]
        zip1    v17.4h, v16.4h, v7.4h
        trn2    v7.4h, v16.4h, v7.4h
        ...
        sshll   v7.4s, v7.4h, #0
        scvtf   v7.4s, v7.4s
        sshll   v4.4s, v17.4h, #0
        scvtf   v4.4s, v4.4s
        ...
        fadd    v4.4s, v4.4s, v7.4s
        fadd    v4.4s, v4.4s, v6.4s
        ld1w    { z5.s }, p1/z, [x12, x11, lsl #2]
        fadd    v4.4s, v5.4s, v4.4s
        str     q4, [x12], #16
        b.ne    .LBB0_4
```

Also reproducible on Godbolt: https://godbolt.org/z/aMTKEsjvj

**Additional notes:**

- The loop has a read-after-write dependence of distance 8 (`out[i]` reads
`out[i-8]` written by an earlier iteration), which is a known-safe distance for
vectorization. Both the distance and direction are compile-time constants.
- The struct layout `{short, short, long}` contains alignment padding and
requires gather-like operations plus integer-to-float conversions (scvtf) to
pack into float vectors.
- Clang, when forced via `#pragma clang loop vectorize(enable)`, successfully
vectorizes the loop with NEON (width 4). This **confirms the loop is
semantically legal to vectorize** — the dependence distance of 8 does not
violate any data dependence constraints. The pragma overrides any cost-model
hesitation, and Clang's legality analysis passes.
- By contrast, GCC's failure occurs **before reaching the cost model**, as
evidenced by:
  - `Analysis failed with vector mode VNx8HI` — a legality/dependence analysis
internal error, not a cost-model rejection.
  - `-fvect-cost-model=unlimited` has no effect on the outcome.
  - No fallback to fixed-width NEON vectorization is attempted, despite NEON
being available under `-march=armv9-a+sve`.
- The bailout at `VNx8HI` and the explicit skip of `VNx16QI` with the message
"would repeat the analysis" suggest a structural issue in the SVE dependence
analyzer when handling mixed-type struct fields, rather than a mode-specific
limitation.

Reply via email to