3x assembly and running 10x slower than manual complete unrolling

elrodc at gmail dot com Mon, 23 Jul 2018 08:01:28 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86625


--- Comment #6 from Chris Elrod <elrodc at gmail dot com> ---
(In reply to Richard Biener from comment #3)
> If you see spilling on the manually unrolled loop register pressure is
> somehow an issue.

In the matmul kernel:
D = A * X
where D is 16x14, A is 16xN, and X is Nx14 (N arbitrarily set to 32)

The code holds all of D in registers.
16x14 doubles, and 8 doubles per register mean 28 of the 32 registers.

Then, it loads 1 column of A at a time (2 more registers), and broadcasts
elements from the corresponding row in each column of X, updating the
corresponding column of D with fma instructions.

By broadcasting 2 at a time, it should be using exactly 32 registers.

For the most part, that is precisely what the manually unrolled code is doing
for each column of A.
However, for column 23 (2944/128 = 23) with -O3 and column 25 for -O2 of the 32
columns of A, it suddenly spills (all the stack accesses happen for the same
column, and none of the others), even though the process is identical for each
column.
Switching to a smaller 16x13 output, freeing up 2 registers to allow 4
broadcast loads at a time, still resulted in 4 spills (down from 5) for only
column #23 or #25.

I couldn't reproduce the spills in the avx2 kernel.
The smaller kernel has an 8x6 output, taking up 12 registers. Again leaving 4
total registers, 2 for a column of A, and 2 broadcasts from X at a time. So
it's the same pattern.


The smaller kernel does reproduce the problems with the loops. Both -O3 without
`-fdisable-tree-cunrolli` leading to a slow vectorization scheme, and with it
or `-O2 -ftree-vectorize` producing repetitive loads and stores within the
loop.

[Bug tree-optimization/86625] funroll-loops doesn't unroll, producing >3x assembly and running 10x slower than manual complete unrolling

Reply via email to