https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88767

--- Comment #13 from Jiu Fu Guo <guojiufu at gcc dot gnu.org> ---
Hi Richard,

As checking the changed code as in comment 9, it seems there is another
opportunity to improve the performance:  By improving locality of array A
usage.

Unroll and jam loop1 into loop4 (or unroll and jam loop1 into loop3 after
loop2/loop4 are unrolled completely), this would reduce memory access by
reusing elements of array A. 

It seems not hard to implement this improvement from the source code aspect (as
the example code shown in comment 9). 
While I'm thinking about how to implement this in GCC.

Some concerns are here.  It is not a `perfect nest` for these loops: there are
stmts/instructions that belong to the outer loop (loop1) but outside the inner
loop(loop4). 
And even delete loop2 (or distribute loop2 out) and unroll loop4, 'store to
array C: C[(l_n*10)+l_m] +=xx` is moved out of the inner loop (loop3), but
still inside the outer loop(loop1).  This is not in favor of 'unroll and jam'.

Thanks for any comments!

BR. 
Jiufu Guo

Reply via email to