https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99634

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |amker at gcc dot gnu.org
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2021-03-18
            Summary|s2102 benchmarks of TSVC is |s2102 benchmarks of TSVC is
                   |vectorized better by icc    |vectorized better by icc
                   |than gcc                    |than gcc, interchange is
                   |                            |missing
             Status|UNCONFIRMED                 |NEW

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
we cannot analyze the dependence between aa[j][i] and aa[i][i] in outer loop
vectorization.

ICC seems to completely unroll the inner loop, doing scalar stores for
everything.  The only thing "vectorized" is the constant.  Not sure
why it uses vextractps at all though - probably an artifact of indeed
being produced by its vectorization.

So what ICC does is quite stupid and not vectorized.  I guess simply
unrolling would end up being faster than ICC.  Unfortunately we're
doing

.L3:
        movl    $0x00000000, (%rax)
        movl    $0x00000000, 1024(%rax)
        movl    $0x00000000, 2048(%rax)
        addq    $8192, %rax
        movl    $0x00000000, -5120(%rax)
        movl    $0x00000000, -4096(%rax)
        movl    $0x00000000, -3072(%rax)
        movl    $0x00000000, -2048(%rax)
        movl    $0x00000000, -1024(%rax)
        cmpq    %rdx, %rax
        jne     .L3

rather than using a register source operand.  We also end up not
streaming to consecutive stores.

So the interesting transform is not vectorization but instead
doing interchange again.  Not sure if we're confused by the
dependence (likely) and thus we'd need loop distribution again.

Reply via email to