https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114107
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Ever confirmed|0 |1 Blocks| |53947 Component|target |tree-optimization Last reconfirmed| |2024-02-26 --- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> --- Note that we fail to SLP vectorize this (at -O3 we unroll the inner loop): t.c:4:20: note: ==> examining statement: _34 = *_33; t.c:4:20: missed: peeling for gaps insufficient for access t.c:5:51: missed: not vectorized: relevant stmt not supported: _34 = *_33; t.c:4:20: note: removing SLP instance operations starting from: *_29 = _35; t.c:4:20: missed: unsupported SLP instances which is because 'factor[i]' is treated as vector load t.c:4:20: note: node 0x687f730 (max_nunits=4, refcnt=2) const vector(4) double t.c:4:20: note: op template: _34 = *_33; t.c:4:20: note: stmt 0 _34 = *_33; t.c:4:20: note: stmt 1 _34 = *_33; t.c:4:20: note: stmt 2 _34 = *_33; t.c:4:20: note: stmt 3 _34 = *_33; t.c:4:20: note: load permutation { 0 0 0 0 } and we don't anticipate we can do this with a load-and-splat (I'm not sure we'd eventually do that even). I think we might have a duplicate bugreport for this issue. Note with GCC 13 we refuse to SLP because t.c:4:20: missed: Build SLP failed: not grouped load _35 = *_34; You can help GCC by doign void rescale_x4(double* __restrict data, const double * __restrict factor, int n) { for (int i=0; i<n; i++) { #pragma GCC unroll 0 for (int k=0; k<4; k++) data[4*i+k] *= factor[i]; } } which will get you rescale_x4: .LFB0: .cfi_startproc testl %edx, %edx jle .L5 movslq %edx, %rdx salq $5, %rdx leaq (%rdi,%rdx), %rax .p2align 4,,10 .p2align 3 .L3: vbroadcastsd (%rsi), %ymm0 addq $32, %rdi addq $8, %rsi vmulpd -32(%rdi), %ymm0, %ymm0 vmovupd %ymm0, -32(%rdi) cmpq %rdi, %rax jne .L3 vzeroupper .L5: ret Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations