https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99634
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- CC| |amker at gcc dot gnu.org Ever confirmed|0 |1 Last reconfirmed| |2021-03-18 Summary|s2102 benchmarks of TSVC is |s2102 benchmarks of TSVC is |vectorized better by icc |vectorized better by icc |than gcc |than gcc, interchange is | |missing Status|UNCONFIRMED |NEW --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- we cannot analyze the dependence between aa[j][i] and aa[i][i] in outer loop vectorization. ICC seems to completely unroll the inner loop, doing scalar stores for everything. The only thing "vectorized" is the constant. Not sure why it uses vextractps at all though - probably an artifact of indeed being produced by its vectorization. So what ICC does is quite stupid and not vectorized. I guess simply unrolling would end up being faster than ICC. Unfortunately we're doing .L3: movl $0x00000000, (%rax) movl $0x00000000, 1024(%rax) movl $0x00000000, 2048(%rax) addq $8192, %rax movl $0x00000000, -5120(%rax) movl $0x00000000, -4096(%rax) movl $0x00000000, -3072(%rax) movl $0x00000000, -2048(%rax) movl $0x00000000, -1024(%rax) cmpq %rdx, %rax jne .L3 rather than using a register source operand. We also end up not streaming to consecutive stores. So the interesting transform is not vectorization but instead doing interchange again. Not sure if we're confused by the dependence (likely) and thus we'd need loop distribution again.