https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amker at gcc dot gnu.org --- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> --- So performance counters on my Broadwell machine say that there are zero hits for this vectorized loop with LSD.UOPS while there are very many for the scalar case. This means the loop body is too large to trigger the LSD (and probably also to fit the uop cache). One of the issue is that we require 4 registers for the indexes into the loads vmovq (%rax,%r11,2), %xmm7 vpinsrq $1, (%rax,%r13), %xmm7, %xmm4 vmovq (%rax), %xmm7 vpinsrq $1, (%rax,%r11), %xmm7, %xmm9 vmovq (%rax,%r15), %xmm7 vpinsrq $1, (%rax,%r12), %xmm7, %xmm3 vmovq (%rax,%r11,4), %xmm7 vpinsrq $1, (%rax,%r14), %xmm7, %xmm1 vinserti128 $0x1, %xmm4, %ymm9, %ymm9 from _1507 = (void *) ivtmp.760_1462; _792 = MEM[base: _1507, offset: 0B]; _1508 = (void *) ivtmp.760_1462; _794 = MEM[base: _1508, index: _331, offset: 0B]; _1509 = (void *) ivtmp.760_1462; _796 = MEM[base: _1509, index: _331, step: 2, offset: 0B]; _1510 = (void *) ivtmp.760_1462; _1511 = _331 * 3; _798 = MEM[base: _1510, index: _1511, offset: 0B]; _1512 = (void *) ivtmp.760_1462; _800 = MEM[base: _1512, index: _331, step: 4, offset: 0B]; _1513 = (void *) ivtmp.760_1462; _1514 = _331 * 5; _802 = MEM[base: _1513, index: _1514, offset: 0B]; _1515 = (void *) ivtmp.760_1462; _1516 = _331 * 6; _804 = MEM[base: _1515, index: _1516, offset: 0B]; _1517 = (void *) ivtmp.760_1462; _1518 = _331 * 7; _806 = MEM[base: _1517, index: _1518, offset: 0B]; vect_cst__808 = {_792, _794, _796, _798, _800, _802, _804, _806}; where IVOPTs did a reasonable job. Later LIM hoists all the invariant _311 * N indexes. And IVOPTs failed to realize that _331 * 3 can be used for _331 * 6 by using step == 2. But in the end the register optimal decision is probably to strength-reduce this (the vectorizer generates strength-reduced code). We do end up spilling most IVs in this loop.