https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
     Ever confirmed|0                           |1
             Blocks|                            |53947
          Component|target                      |tree-optimization
   Last reconfirmed|                            |2024-02-26

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
Note that we fail to SLP vectorize this (at -O3 we unroll the inner loop):

t.c:4:20: note:   ==> examining statement: _34 = *_33;
t.c:4:20: missed:   peeling for gaps insufficient for access
t.c:5:51: missed:   not vectorized: relevant stmt not supported: _34 = *_33;
t.c:4:20: note:   removing SLP instance operations starting from: *_29 = _35;
t.c:4:20: missed:  unsupported SLP instances

which is because 'factor[i]' is treated as vector load

t.c:4:20: note:   node 0x687f730 (max_nunits=4, refcnt=2) const vector(4)
double
t.c:4:20: note:   op template: _34 = *_33;
t.c:4:20: note:         stmt 0 _34 = *_33;
t.c:4:20: note:         stmt 1 _34 = *_33;
t.c:4:20: note:         stmt 2 _34 = *_33;
t.c:4:20: note:         stmt 3 _34 = *_33;
t.c:4:20: note:         load permutation { 0 0 0 0 }

and we don't anticipate we can do this with a load-and-splat (I'm not sure
we'd eventually do that even).

I think we might have a duplicate bugreport for this issue.

Note with GCC 13 we refuse to SLP because

t.c:4:20: missed:   Build SLP failed: not grouped load _35 = *_34;

You can help GCC by doign

void rescale_x4(double* __restrict data, const double * __restrict factor, int
n)
{
    for (int i=0; i<n; i++) {
#pragma GCC unroll 0
     for (int k=0; k<4; k++) data[4*i+k] *= factor[i];
    }
}

which will get you

rescale_x4:
.LFB0:
        .cfi_startproc
        testl   %edx, %edx
        jle     .L5
        movslq  %edx, %rdx
        salq    $5, %rdx
        leaq    (%rdi,%rdx), %rax
        .p2align 4,,10
        .p2align 3
.L3:
        vbroadcastsd    (%rsi), %ymm0
        addq    $32, %rdi
        addq    $8, %rsi
        vmulpd  -32(%rdi), %ymm0, %ymm0
        vmovupd %ymm0, -32(%rdi)
        cmpq    %rdi, %rax
        jne     .L3
        vzeroupper
.L5:
        ret


Referenced Bugs:

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947
[Bug 53947] [meta-bug] vectorizer missed-optimizations

Reply via email to