https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amker at gcc dot gnu.org

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
So performance counters on my Broadwell machine say that there are zero hits
for this vectorized loop with LSD.UOPS while there are very many for the scalar
case.  This means the loop body is too large to trigger the LSD (and probably
also to fit the uop cache).

One of the issue is that we require 4 registers for the indexes into the loads

        vmovq   (%rax,%r11,2), %xmm7
        vpinsrq $1, (%rax,%r13), %xmm7, %xmm4
        vmovq   (%rax), %xmm7
        vpinsrq $1, (%rax,%r11), %xmm7, %xmm9
        vmovq   (%rax,%r15), %xmm7
        vpinsrq $1, (%rax,%r12), %xmm7, %xmm3
        vmovq   (%rax,%r11,4), %xmm7
        vpinsrq $1, (%rax,%r14), %xmm7, %xmm1
        vinserti128     $0x1, %xmm4, %ymm9, %ymm9

from

  _1507 = (void *) ivtmp.760_1462;
  _792 = MEM[base: _1507, offset: 0B];
  _1508 = (void *) ivtmp.760_1462;
  _794 = MEM[base: _1508, index: _331, offset: 0B];
  _1509 = (void *) ivtmp.760_1462;
  _796 = MEM[base: _1509, index: _331, step: 2, offset: 0B];
  _1510 = (void *) ivtmp.760_1462;
  _1511 = _331 * 3;
  _798 = MEM[base: _1510, index: _1511, offset: 0B];
  _1512 = (void *) ivtmp.760_1462;
  _800 = MEM[base: _1512, index: _331, step: 4, offset: 0B];
  _1513 = (void *) ivtmp.760_1462;
  _1514 = _331 * 5;
  _802 = MEM[base: _1513, index: _1514, offset: 0B];
  _1515 = (void *) ivtmp.760_1462;
  _1516 = _331 * 6;
  _804 = MEM[base: _1515, index: _1516, offset: 0B];
  _1517 = (void *) ivtmp.760_1462;
  _1518 = _331 * 7;
  _806 = MEM[base: _1517, index: _1518, offset: 0B];
  vect_cst__808 = {_792, _794, _796, _798, _800, _802, _804, _806};

where IVOPTs did a reasonable job.  Later LIM hoists all the invariant
_311 * N indexes.  And IVOPTs failed to realize that _331 * 3 can be used
for _331 * 6 by using step == 2.  But in the end the register optimal
decision is probably to strength-reduce this (the vectorizer generates
strength-reduced code).

We do end up spilling most IVs in this loop.

Reply via email to