Re: 22% degradation seen in embench:matmult-int

Visda.Vokhshoori--- via Gcc Fri, 14 Feb 2025 07:59:12 -0800

“tem = Index == 0 ? 0 : (*(matrix *)Res)[Outer][Inner];”

When I compared the assembly statement of the loop these extra statements are 
in the most inner loop.


    400968:       2800            cmp     r0, #0

  400972:       bf08            it      eq
  400974:       2300            moveq   r3, #0

R3 being Res.
It is the statement you have above.

“a CPU uarch with caches and HW prefetching where linear accesses are a lot more
efficient than strided ones - that might not hold at all for the
Cortex-M7.”

Yes that’s it.

Thanks a lot for your help!

From: Richard Biener <richard.guent...@gmail.com>
Date: Friday, February 14, 2025 at 2:26 AM
To: Visda Vokhshoori - C51841 <visda.vokhsho...@microchip.com>
Cc: gcc@gcc.gnu.org <gcc@gcc.gnu.org>
Subject: Re: 22% degradation seen in embench:matmult-int
[You don't often get email from richard.guent...@gmail.com. Learn why this is 
important at https://aka.ms/LearnAboutSenderIdentification ]

EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
content is safe

On Thu, Feb 13, 2025 at 9:30 PM <visda.vokhsho...@microchip.com> wrote:
>
>
>
> “the interchanged loop might for example no longer vectorize.”
>
>
>
> The loops are not vectorized.  Which is ok, because this device doesn’t have 
> the support for it.
>
> I just don’t think a pass could single handedly make code slower that much.
>
>
>
> Loop interchange is supposed to interchange the loop nest index with outer 
> index to improve cache locality.  This is supposed to help -that is the next 
> iteration we will have the data available in cache.
>
>
>
> The benchmark source –and  the loop that gets interchanged is line 143
>
>
>
> Source: 
> https://github.com/embench/embench-iot/blob/master/src/matmult-int/matmult-int.c#L143

Looks like the classical matmul loop, similar to the one in SPEC CPU
bwaves.  We do
apply interchange here and that looks reasonable to me.  Note
interchange assumes
a CPU uarch with caches and HW prefetching where linear accesses are a lot more
efficient than strided ones - that might not hold at all for the
Cortex-M7.  Without
interchange the store to Res[] can be moved out of the inner loop.

I've tried

#define UPPERLIMIT 20
typedef long matrix[UPPERLIMIT][UPPERLIMIT];
void
Multiply (matrix A, matrix B, long * __restrict Res)
{
  register int Outer, Inner, Index;

  for (Outer = 0; Outer < UPPERLIMIT; Outer++)
    for (Inner = 0; Inner < UPPERLIMIT; Inner++)
      {
        (*(matrix *)Res)[Outer][Inner] = 0;
        for (Index = 0; Index < UPPERLIMIT; Index++)
          (*(matrix *)Res)[Outer][Inner] += A[Outer][Index] * B[Index][Inner];
      }
}

and this is interchanged on x86_64 as well.  We are implementing a trick
for the zeroing which, when moved into innermost position is done as

  for (Index = 0; Index < UPPERLIMIT; Index++)
    for (Inner = 0; Inner < UPPERLIMIT; Inner++)
       {
          tem = Index == 0 ? 0 : (*(matrix *)Res)[Outer][Inner];
          tem += A[Outer][Index] * B[Index][Inner];
          (*(matrix *)Res)[Outer][Inner] = tem;
       }

this conditional might kill performance for you.  The advantage is that this
loop can now be more efficiently vectorized.



>
>
> This loop is where most of the time is spent. But it would have been good if 
> I had access to h/w tracing to see if the interchanged loop reduces cache 
> misses as well as to see what is causing it to run this much slower.
>
>
>
> Thanks for your reply!
>
>
>
> From: Richard Biener <richard.guent...@gmail.com>
> Date: Thursday, February 13, 2025 at 2:57 AM
> To: Visda Vokhshoori - C51841 <visda.vokhsho...@microchip.com>
> Cc: gcc@gcc.gnu.org <gcc@gcc.gnu.org>
> Subject: Re: 22% degradation seen in embench:matmult-int
>
> [You don't often get email from richard.guent...@gmail.com. Learn why this is 
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
> EXTERNAL EMAIL: Do not click links or open attachments unless you know the 
> content is safe
>
> On Wed, Feb 12, 2025 at 4:38 PM Visda.Vokhshoori--- via Gcc
> <gcc@gcc.gnu.org> wrote:
> >
> > Embench is used for benchmarking on embedded devices.
> > This one project matmult-int has a function Multiply.  It’s a matrix 
> > multiplication for 20 x 20 matrix.
> > The device is a ATSAME70Q21B which is Cortex-M7
> > The compiler is arm branch based on GCC version 13
> > We are compiling with O3 which has loop-interchange pass on by default.
> >
> > When we compile with -fno-loop-interchange we get all 22% back plus 5% 
> > speed up.
> >
> > When we do the loop interchange on the one loop nest that get interchanged 
> > it is slightly (.7%) faster.
> >
> > Has anyone else seen large degradation as a result of loop interchange?
>
> I would suggest to compare the -fopt-info diagnostic output with and
> without -fno-loop-interchange,
> the interchanged loop might for example no longer vectorize.  Other
> than that - no, loop interchange
> isn't applied very often and it has a very conservative cost model.
>
> Are you able to share a testcase?
>
> Richard.
>
> >
> > Thanks

Re: 22% degradation seen in embench:matmult-int

Reply via email to