https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92080
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |crazylht at gmail dot com Blocks| |53947 Component|middle-end |rtl-optimization --- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- Similar when vectorizing int a[4096]; void foo () { for (int i = 1; i < 4095; ++i) a[i] = 42; } the combination of peeling for alignment and the epilog yields on GIMPLE: <bb 2> [local count: 10737416]: MEM <vector(8) int> [(int *)&a + 4B] = { 42, 42, 42, 42, 42, 42, 42, 42 }; MEM <vector(4) int> [(int *)&a + 36B] = { 42, 42, 42, 42 }; MEM <vector(2) int> [(int *)&a + 52B] = { 42, 42 }; a[15] = 42; ivtmp.28_59 = (unsigned long) &MEM <int[4096]> [(void *)&a + 64B]; _1 = (unsigned long) &a; _182 = _1 + 16320; <bb 3> [local count: 75161909]: # ivtmp.28_71 = PHI <ivtmp.28_65(3), ivtmp.28_59(2)> _21 = (void *) ivtmp.28_71; MEM <vector(16) int> [(int *)_21] = { 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42, 42 }; ivtmp.28_65 = ivtmp.28_71 + 64; if (ivtmp.28_65 != _182) goto <bb 3>; [85.71%] else goto <bb 4>; [14.29%] <bb 4> [local count: 21474835]: MEM <vector(8) int> [(int *)&a + 16320B] = { 42, 42, 42, 42, 42, 42, 42, 42 }; MEM <vector(4) int> [(int *)&a + 16352B] = { 42, 42, 42, 42 }; MEM <vector(2) int> [(int *)&a + 16368B] = { 42, 42 }; a[4094] = 42; return; and that in turn causes a lot of redundant broadcasts from constants (via GPRs): foo: .LFB0: .cfi_startproc movl $42, %eax movq .LC2(%rip), %rcx movl $42, %edx movl $42, a+60(%rip) vpbroadcastd %eax, %ymm0 vmovdqu %ymm0, a+4(%rip) vpbroadcastd %eax, %xmm0 movl $a+64, %eax vmovdqu %xmm0, a+36(%rip) vpbroadcastd %edx, %zmm0 movq %rcx, a+52(%rip) .L2: vmovdqa32 %zmm0, (%rax) subq $-128, %rax vmovdqa32 %zmm0, -64(%rax) cmpq $a+16320, %rax jne .L2 vpbroadcastd %edx, %ymm0 movq %rcx, a+16368(%rip) movl $42, a+16376(%rip) vmovdqa %ymm0, a+16320(%rip) vpbroadcastd %edx, %xmm0 vmovdqa %xmm0, a+16352(%rip) vzeroupper ret as they are constant on GIMPLE any "CSE" we'd perform there would be undone quickly by constant propagation. So it's only on RTL where the actual broadcast is a non-constant operation that we can and should optimize this somehow. Some kind of LCM to also handle earlier small but later bigger broadcasts would be necessary here. Referenced Bugs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=53947 [Bug 53947] [meta-bug] vectorizer missed-optimizations