[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

rguenth at gcc dot gnu.org Mon, 29 Jan 2018 07:13:38 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84037


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amker at gcc dot gnu.org

--- Comment #13 from Richard Biener <rguenth at gcc dot gnu.org> ---
So performance counters on my Broadwell machine say that there are zero hits
for this vectorized loop with LSD.UOPS while there are very many for the scalar
case.  This means the loop body is too large to trigger the LSD (and probably
also to fit the uop cache).

One of the issue is that we require 4 registers for the indexes into the loads

        vmovq   (%rax,%r11,2), %xmm7
        vpinsrq $1, (%rax,%r13), %xmm7, %xmm4
        vmovq   (%rax), %xmm7
        vpinsrq $1, (%rax,%r11), %xmm7, %xmm9
        vmovq   (%rax,%r15), %xmm7
        vpinsrq $1, (%rax,%r12), %xmm7, %xmm3
        vmovq   (%rax,%r11,4), %xmm7
        vpinsrq $1, (%rax,%r14), %xmm7, %xmm1
        vinserti128     $0x1, %xmm4, %ymm9, %ymm9

from

  _1507 = (void *) ivtmp.760_1462;
  _792 = MEM[base: _1507, offset: 0B];
  _1508 = (void *) ivtmp.760_1462;
  _794 = MEM[base: _1508, index: _331, offset: 0B];
  _1509 = (void *) ivtmp.760_1462;
  _796 = MEM[base: _1509, index: _331, step: 2, offset: 0B];
  _1510 = (void *) ivtmp.760_1462;
  _1511 = _331 * 3;
  _798 = MEM[base: _1510, index: _1511, offset: 0B];
  _1512 = (void *) ivtmp.760_1462;
  _800 = MEM[base: _1512, index: _331, step: 4, offset: 0B];
  _1513 = (void *) ivtmp.760_1462;
  _1514 = _331 * 5;
  _802 = MEM[base: _1513, index: _1514, offset: 0B];
  _1515 = (void *) ivtmp.760_1462;
  _1516 = _331 * 6;
  _804 = MEM[base: _1515, index: _1516, offset: 0B];
  _1517 = (void *) ivtmp.760_1462;
  _1518 = _331 * 7;
  _806 = MEM[base: _1517, index: _1518, offset: 0B];
  vect_cst__808 = {_792, _794, _796, _798, _800, _802, _804, _806};

where IVOPTs did a reasonable job.  Later LIM hoists all the invariant
_311 * N indexes.  And IVOPTs failed to realize that _331 * 3 can be used
for _331 * 6 by using step == 2.  But in the end the register optimal
decision is probably to strength-reduce this (the vectorizer generates
strength-reduced code).

We do end up spilling most IVs in this loop.

[Bug tree-optimization/84037] [8 Regression] Speed regression of polyhedron benchmark since r256644

Reply via email to