“tem = Index == 0 ? 0 : (*(matrix *)Res)[Outer][Inner];” When I compared the assembly statement of the loop these extra statements are in the most inner loop.
400968: 2800 cmp r0, #0 400972: bf08 it eq 400974: 2300 moveq r3, #0 R3 being Res. It is the statement you have above. “a CPU uarch with caches and HW prefetching where linear accesses are a lot more efficient than strided ones - that might not hold at all for the Cortex-M7.” Yes that’s it. Thanks a lot for your help! From: Richard Biener <richard.guent...@gmail.com> Date: Friday, February 14, 2025 at 2:26 AM To: Visda Vokhshoori - C51841 <visda.vokhsho...@microchip.com> Cc: gcc@gcc.gnu.org <gcc@gcc.gnu.org> Subject: Re: 22% degradation seen in embench:matmult-int [You don't often get email from richard.guent...@gmail.com. Learn why this is important at https://aka.ms/LearnAboutSenderIdentification ] EXTERNAL EMAIL: Do not click links or open attachments unless you know the content is safe On Thu, Feb 13, 2025 at 9:30 PM <visda.vokhsho...@microchip.com> wrote: > > > > “the interchanged loop might for example no longer vectorize.” > > > > The loops are not vectorized. Which is ok, because this device doesn’t have > the support for it. > > I just don’t think a pass could single handedly make code slower that much. > > > > Loop interchange is supposed to interchange the loop nest index with outer > index to improve cache locality. This is supposed to help -that is the next > iteration we will have the data available in cache. > > > > The benchmark source –and the loop that gets interchanged is line 143 > > > > Source: > https://github.com/embench/embench-iot/blob/master/src/matmult-int/matmult-int.c#L143 Looks like the classical matmul loop, similar to the one in SPEC CPU bwaves. We do apply interchange here and that looks reasonable to me. Note interchange assumes a CPU uarch with caches and HW prefetching where linear accesses are a lot more efficient than strided ones - that might not hold at all for the Cortex-M7. Without interchange the store to Res[] can be moved out of the inner loop. I've tried #define UPPERLIMIT 20 typedef long matrix[UPPERLIMIT][UPPERLIMIT]; void Multiply (matrix A, matrix B, long * __restrict Res) { register int Outer, Inner, Index; for (Outer = 0; Outer < UPPERLIMIT; Outer++) for (Inner = 0; Inner < UPPERLIMIT; Inner++) { (*(matrix *)Res)[Outer][Inner] = 0; for (Index = 0; Index < UPPERLIMIT; Index++) (*(matrix *)Res)[Outer][Inner] += A[Outer][Index] * B[Index][Inner]; } } and this is interchanged on x86_64 as well. We are implementing a trick for the zeroing which, when moved into innermost position is done as for (Index = 0; Index < UPPERLIMIT; Index++) for (Inner = 0; Inner < UPPERLIMIT; Inner++) { tem = Index == 0 ? 0 : (*(matrix *)Res)[Outer][Inner]; tem += A[Outer][Index] * B[Index][Inner]; (*(matrix *)Res)[Outer][Inner] = tem; } this conditional might kill performance for you. The advantage is that this loop can now be more efficiently vectorized. > > > This loop is where most of the time is spent. But it would have been good if > I had access to h/w tracing to see if the interchanged loop reduces cache > misses as well as to see what is causing it to run this much slower. > > > > Thanks for your reply! > > > > From: Richard Biener <richard.guent...@gmail.com> > Date: Thursday, February 13, 2025 at 2:57 AM > To: Visda Vokhshoori - C51841 <visda.vokhsho...@microchip.com> > Cc: gcc@gcc.gnu.org <gcc@gcc.gnu.org> > Subject: Re: 22% degradation seen in embench:matmult-int > > [You don't often get email from richard.guent...@gmail.com. Learn why this is > important at https://aka.ms/LearnAboutSenderIdentification ] > > EXTERNAL EMAIL: Do not click links or open attachments unless you know the > content is safe > > On Wed, Feb 12, 2025 at 4:38 PM Visda.Vokhshoori--- via Gcc > <gcc@gcc.gnu.org> wrote: > > > > Embench is used for benchmarking on embedded devices. > > This one project matmult-int has a function Multiply. It’s a matrix > > multiplication for 20 x 20 matrix. > > The device is a ATSAME70Q21B which is Cortex-M7 > > The compiler is arm branch based on GCC version 13 > > We are compiling with O3 which has loop-interchange pass on by default. > > > > When we compile with -fno-loop-interchange we get all 22% back plus 5% > > speed up. > > > > When we do the loop interchange on the one loop nest that get interchanged > > it is slightly (.7%) faster. > > > > Has anyone else seen large degradation as a result of loop interchange? > > I would suggest to compare the -fopt-info diagnostic output with and > without -fno-loop-interchange, > the interchanged loop might for example no longer vectorize. Other > than that - no, loop interchange > isn't applied very often and it has a very conservative cost model. > > Are you able to share a testcase? > > Richard. > > > > > Thanks