> 
> This came up in a separate thread as well, but when doing reassoc of a
> chain with
> multiple dependent FMAs.
> 
> I can't understand how this uarch detail can affect performance when
> as in the testcase
> the longest input latency is on the multiplication from a memory load.
> Do we actually
> understand _why_ the FMAs are slower here?

This is my understanding:
The loop is well predictable and memory caluclations + loads can happen
in parallel.  So the main dependency chain is updating the accumulator
computing c[i][j].  FMADD is 4 cycles on Zen4, while ADD is 3.  So the
loop with FMADD can not run any faster than one iteration per 4 cycles,
while ADD can do one iteration per 3.  Which roughtly matches the
speedup we see 484875179*3/4=363656384 while measured speed is 375875209
cycles.  The benchmark is quite short and I run it 100 times in perf to
collect the data so the overhead is probably attributing to smaller then
expected difference.

> 
> Do we know that Cores can start the multiplication part when the add
> operand isn't
> ready yet?  I'm curious how you set up a micro benchmark to measure this.

Here is cycle counting benchmark:
#include <stdio.h>
int
main()
{ 
  float o=0;
  for (int i = 0; i < 1000000000; i++)
  {
#ifdef ACCUMULATE
    float p1 = o;
    float p2 = 0;
#else
    float p1 = 0;
    float p2 = o;
#endif
    float p3 = 0;
#ifdef FMA
    asm ("vfmadd231ss %2, %3, %0":"=x"(o):"0"(p1),"x"(p2),"x"(p3));
#else
    float t;
    asm ("mulss %2, %0":"=x"(t):"0"(p2),"x"(p3));
    asm ("addss %2, %0":"=x"(o):"0"(p1),"x"(t));
#endif
  }
  printf ("%f\n",o);
  return 0;
}

It performs FMAs in sequence all with zeros.  If you define ACCUMULATE
you get the pattern from matrix multiplication. On Zen I get:

jh@ryzen3:~> gcc -O3 -DFMA -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep 
cycles:
     4,001,011,489      cycles:u                         #    4.837 GHz         
                (83.32%)
jh@ryzen3:~> gcc -O3 -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles:
     3,000,335,064      cycles:u                         #    4.835 GHz         
                (83.08%)

So 4 cycles for FMA loop and 3 cycles for separate mul and add.
Muls execute in parallel to adds in the second case.
If the dependence chain is done over multiplied paramter I get:

jh@ryzen3:~> gcc -O3 -DFMA l.c ; perf stat ./a.out 2>&1 | grep cycles:
     4,000,118,069      cycles:u                         #    4.836 GHz         
                (83.32%)
jh@ryzen3:~> gcc -O3  l.c ; perf stat ./a.out 2>&1 | grep cycles:
     6,001,947,341      cycles:u                         #    4.838 GHz         
                (83.32%)

FMA is the same (it is still one FMA instruction periteration) while
mul+add is 6 cycles since the dependency chain is longer.

Core gives me:

jh@aster:~> gcc -O3 l.c -DFMA -DACCUMULATE ; perf stat ./a.out 2>&1 | grep 
cycles:u
     5,001,515,473      cycles:u                         #    3.796 GHz
jh@aster:~> gcc -O3 l.c  -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u
     4,000,977,739      cycles:u                         #    3.819 GHz
jh@aster:~> gcc -O3 l.c  -DFMA ; perf stat ./a.out 2>&1 | grep cycles:u
     5,350,523,047      cycles:u                         #    3.814 GHz
jh@aster:~> gcc -O3 l.c   ; perf stat ./a.out 2>&1 | grep cycles:u
    10,251,994,240      cycles:u                         #    3.852 GHz

So FMA seems 5 cycles if we accumulate and bit more (off noise) if we do
the long chain.  I think some cores have bigger difference between these
two numbers.
I am bit surprised of the last number of 10 cycles.  I would expect 8.

I changed the matrix multiplication benchmark to repeat multiplication
100 times

> 
> There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per 
> cycle.
> So in theory we can at most do 2 FMA per cycle but with latency (FMA)
> == 4 for Zen3/4
> and latency (FADD/FMUL) == 3 we might be able to squeeze out a little bit more
> throughput when there are many FADD/FMUL ops to execute?  That works 
> independent
> on whether FMAs have a head-start on multiplication as you'd still be
> bottle-necked
> on the 2-wide issue for FMA?

I am not sure I follow what you say here.  The knob only checks for
FMADDS used in accmulation type loop, so it is latency 4 and latency 3
per accumulation.  Indeed in ohter loops fmadd is win.
> 
> On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a 
> latency
> of four.  So you should get worse results there (looking at the
> numbers above you
> do get worse results, slightly so), probably the higher number of uops is 
> hidden
> by the latency.
I think the slower non-FMA on Core was just a noise (it shows in overall
time but not in cycle counts).

I changed the benchmark to run the multiplication 100 times.
On Intel I get:

jh@aster:~/gcc/build/gcc> gcc matrix-nofma.s ; perf stat ./a.out
        mult took   15146405 clocks

 Performance counter stats for './a.out':

         15,149.62 msec task-clock:u                     #    1.000 CPUs 
utilized             
                 0      context-switches:u               #    0.000 /sec        
              
                 0      cpu-migrations:u                 #    0.000 /sec        
              
               948      page-faults:u                    #   62.576 /sec        
              
    55,803,919,561      cycles:u                         #    3.684 GHz         
              
    87,615,590,411      instructions:u                   #    1.57  insn per 
cycle            
    12,512,896,307      branches:u                       #  825.955 M/sec       
              
        12,605,403      branch-misses:u                  #    0.10% of all 
branches           

      15.150064855 seconds time elapsed

      15.146817000 seconds user
       0.003333000 seconds sys


jh@aster:~/gcc/build/gcc> gcc matrix-fma.s ; perf stat ./a.out
        mult took   15308879 clocks

 Performance counter stats for './a.out':

         15,312.27 msec task-clock:u                     #    1.000 CPUs 
utilized             
                 1      context-switches:u               #    0.000 /sec        
              
                 0      cpu-migrations:u                 #    0.000 /sec        
              
               948      page-faults:u                    #   61.911 /sec        
              
    59,449,535,152      cycles:u                         #    3.882 GHz         
              
    75,115,590,460      instructions:u                   #    1.26  insn per 
cycle            
    12,512,896,356      branches:u                       #  817.181 M/sec       
              
        12,605,235      branch-misses:u                  #    0.10% of all 
branches           

      15.312776274 seconds time elapsed

      15.309462000 seconds user
       0.003333000 seconds sys

The difference seems close to noise.
If I am counting right, with 100*1000*1000*1000 multiplications
5*100*1000*1000*1000/8=62500000000 cycles overall.
Perhaps since the chain is independent for every 125 multilications it
runs a bit fater.

jh@alberti:~> gcc matrix-nofma.s ; perf stat ./a.out
        mult took   10046353 clocks

 Performance counter stats for './a.out':

          10051.47 msec task-clock:u                     #    0.999 CPUs 
utilized          
                 0      context-switches:u               #    0.000 /sec        
           
                 0      cpu-migrations:u                 #    0.000 /sec        
           
               940      page-faults:u                    #   93.519 /sec        
           
       36983540385      cycles:u                         #    3.679 GHz         
             (83.34%)
           3535506      stalled-cycles-frontend:u        #    0.01% frontend 
cycles idle     (83.33%)
          12252917      stalled-cycles-backend:u         #    0.03% backend 
cycles idle      (83.34%)
       87650235892      instructions:u                   #    2.37  insn per 
cycle         
                                                  #    0.00  stalled cycles per 
insn  (83.34%)
       12504689935      branches:u                       #    1.244 G/sec       
             (83.33%)
          12606975      branch-misses:u                  #    0.10% of all 
branches          (83.32%)

      10.059089949 seconds time elapsed

      10.048218000 seconds user
       0.003998000 seconds sys


jh@alberti:~> gcc matrix-fma.s ; perf stat ./a.out
        mult took   13147631 clocks

 Performance counter stats for './a.out':

          13152.81 msec task-clock:u                     #    0.999 CPUs 
utilized          
                 0      context-switches:u               #    0.000 /sec        
           
                 0      cpu-migrations:u                 #    0.000 /sec        
           
               940      page-faults:u                    #   71.468 /sec        
           
       48394201333      cycles:u                         #    3.679 GHz         
             (83.32%)
           4251637      stalled-cycles-frontend:u        #    0.01% frontend 
cycles idle     (83.32%)
          13664772      stalled-cycles-backend:u         #    0.03% backend 
cycles idle      (83.34%)
       75101376364      instructions:u                   #    1.55  insn per 
cycle         
                                                  #    0.00  stalled cycles per 
insn  (83.35%)
       12510705466      branches:u                       #  951.182 M/sec       
             (83.34%)
          12612898      branch-misses:u                  #    0.10% of all 
branches          (83.33%)

      13.162186067 seconds time elapsed

      13.153354000 seconds user
       0.000000000 seconds sys

So here I wuld expet 3*100*1000*1000*1000/8=37500000000 cycles for first
and 4*100*1000*1000*1000/8 = 50000000000 cycles for second.
So again small over-statement apparently due to parallelism between
vector multiplication, but overall it seems to match what I would expect
to see.

Honza

> 
> > Since this seems noticeable win on zen and not loss on Core it seems like 
> > good
> > default for generic.
> >
> > I plan to commit the patch next week if there are no compplains.
> 
> complaint!
> 
> Richard.
> 
> > Honza
> >
> > #include <stdio.h>
> > #include <time.h>
> >
> > #define SIZE 1000
> >
> > float a[SIZE][SIZE];
> > float b[SIZE][SIZE];
> > float c[SIZE][SIZE];
> >
> > void init(void)
> > {
> >    int i, j, k;
> >    for(i=0; i<SIZE; ++i)
> >    {
> >       for(j=0; j<SIZE; ++j)
> >       {
> >          a[i][j] = (float)i + j;
> >          b[i][j] = (float)i - j;
> >          c[i][j] = 0.0f;
> >       }
> >    }
> > }
> >
> > void mult(void)
> > {
> >    int i, j, k;
> >
> >    for(i=0; i<SIZE; ++i)
> >    {
> >       for(j=0; j<SIZE; ++j)
> >       {
> >          for(k=0; k<SIZE; ++k)
> >          {
> >             c[i][j] += a[i][k] * b[k][j];
> >          }
> >       }
> >    }
> > }
> >
> > int main(void)
> > {
> >    clock_t s, e;
> >
> >    init();
> >    s=clock();
> >    mult();
> >    e=clock();
> >    printf("        mult took %10d clocks\n", (int)(e-s));
> >
> >    return 0;
> >
> > }
> >
> >         * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, 
> > X86_TUNE_AVOID_256FMA_CHAINS)
> >         Enable for znver4 and Core.
> >
> > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def
> > index 43fa9e8fd6d..74b03cbcc60 100644
> > --- a/gcc/config/i386/x86-tune.def
> > +++ b/gcc/config/i386/x86-tune.def
> > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, 
> > "use_scatter_8parts",
> >
> >  /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or
> >     smaller FMA chain.  */
> > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | 
> > m_ZNVER2 | m_ZNVER3
> > -          | m_YONGFENG)
> > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | 
> > m_ZNVER2 | m_ZNVER3 | m_ZNVER4
> > +          | m_YONGFENG | m_GENERIC)
> >
> >  /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or
> >     smaller FMA chain.  */
> > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | 
> > m_ZNVER3
> > -         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM)
> > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | 
> > m_ZNVER3 | m_ZNVER4
> > +         | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC)
> >
> >  /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or
> >     smaller FMA chain.  */

Reply via email to