> > This came up in a separate thread as well, but when doing reassoc of a > chain with > multiple dependent FMAs. > > I can't understand how this uarch detail can affect performance when > as in the testcase > the longest input latency is on the multiplication from a memory load. > Do we actually > understand _why_ the FMAs are slower here?
This is my understanding: The loop is well predictable and memory caluclations + loads can happen in parallel. So the main dependency chain is updating the accumulator computing c[i][j]. FMADD is 4 cycles on Zen4, while ADD is 3. So the loop with FMADD can not run any faster than one iteration per 4 cycles, while ADD can do one iteration per 3. Which roughtly matches the speedup we see 484875179*3/4=363656384 while measured speed is 375875209 cycles. The benchmark is quite short and I run it 100 times in perf to collect the data so the overhead is probably attributing to smaller then expected difference. > > Do we know that Cores can start the multiplication part when the add > operand isn't > ready yet? I'm curious how you set up a micro benchmark to measure this. Here is cycle counting benchmark: #include <stdio.h> int main() { float o=0; for (int i = 0; i < 1000000000; i++) { #ifdef ACCUMULATE float p1 = o; float p2 = 0; #else float p1 = 0; float p2 = o; #endif float p3 = 0; #ifdef FMA asm ("vfmadd231ss %2, %3, %0":"=x"(o):"0"(p1),"x"(p2),"x"(p3)); #else float t; asm ("mulss %2, %0":"=x"(t):"0"(p2),"x"(p3)); asm ("addss %2, %0":"=x"(o):"0"(p1),"x"(t)); #endif } printf ("%f\n",o); return 0; } It performs FMAs in sequence all with zeros. If you define ACCUMULATE you get the pattern from matrix multiplication. On Zen I get: jh@ryzen3:~> gcc -O3 -DFMA -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles: 4,001,011,489 cycles:u # 4.837 GHz (83.32%) jh@ryzen3:~> gcc -O3 -DACCUMULATE l.c ; perf stat ./a.out 2>&1 | grep cycles: 3,000,335,064 cycles:u # 4.835 GHz (83.08%) So 4 cycles for FMA loop and 3 cycles for separate mul and add. Muls execute in parallel to adds in the second case. If the dependence chain is done over multiplied paramter I get: jh@ryzen3:~> gcc -O3 -DFMA l.c ; perf stat ./a.out 2>&1 | grep cycles: 4,000,118,069 cycles:u # 4.836 GHz (83.32%) jh@ryzen3:~> gcc -O3 l.c ; perf stat ./a.out 2>&1 | grep cycles: 6,001,947,341 cycles:u # 4.838 GHz (83.32%) FMA is the same (it is still one FMA instruction periteration) while mul+add is 6 cycles since the dependency chain is longer. Core gives me: jh@aster:~> gcc -O3 l.c -DFMA -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u 5,001,515,473 cycles:u # 3.796 GHz jh@aster:~> gcc -O3 l.c -DACCUMULATE ; perf stat ./a.out 2>&1 | grep cycles:u 4,000,977,739 cycles:u # 3.819 GHz jh@aster:~> gcc -O3 l.c -DFMA ; perf stat ./a.out 2>&1 | grep cycles:u 5,350,523,047 cycles:u # 3.814 GHz jh@aster:~> gcc -O3 l.c ; perf stat ./a.out 2>&1 | grep cycles:u 10,251,994,240 cycles:u # 3.852 GHz So FMA seems 5 cycles if we accumulate and bit more (off noise) if we do the long chain. I think some cores have bigger difference between these two numbers. I am bit surprised of the last number of 10 cycles. I would expect 8. I changed the matrix multiplication benchmark to repeat multiplication 100 times > > There's one detail on Zen in that it can issue 2 FADDs and 2 FMUL/FMA per > cycle. > So in theory we can at most do 2 FMA per cycle but with latency (FMA) > == 4 for Zen3/4 > and latency (FADD/FMUL) == 3 we might be able to squeeze out a little bit more > throughput when there are many FADD/FMUL ops to execute? That works > independent > on whether FMAs have a head-start on multiplication as you'd still be > bottle-necked > on the 2-wide issue for FMA? I am not sure I follow what you say here. The knob only checks for FMADDS used in accmulation type loop, so it is latency 4 and latency 3 per accumulation. Indeed in ohter loops fmadd is win. > > On Icelake it seems all FADD/FMUL/FMA share ports 0 and 1 and all have a > latency > of four. So you should get worse results there (looking at the > numbers above you > do get worse results, slightly so), probably the higher number of uops is > hidden > by the latency. I think the slower non-FMA on Core was just a noise (it shows in overall time but not in cycle counts). I changed the benchmark to run the multiplication 100 times. On Intel I get: jh@aster:~/gcc/build/gcc> gcc matrix-nofma.s ; perf stat ./a.out mult took 15146405 clocks Performance counter stats for './a.out': 15,149.62 msec task-clock:u # 1.000 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 948 page-faults:u # 62.576 /sec 55,803,919,561 cycles:u # 3.684 GHz 87,615,590,411 instructions:u # 1.57 insn per cycle 12,512,896,307 branches:u # 825.955 M/sec 12,605,403 branch-misses:u # 0.10% of all branches 15.150064855 seconds time elapsed 15.146817000 seconds user 0.003333000 seconds sys jh@aster:~/gcc/build/gcc> gcc matrix-fma.s ; perf stat ./a.out mult took 15308879 clocks Performance counter stats for './a.out': 15,312.27 msec task-clock:u # 1.000 CPUs utilized 1 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 948 page-faults:u # 61.911 /sec 59,449,535,152 cycles:u # 3.882 GHz 75,115,590,460 instructions:u # 1.26 insn per cycle 12,512,896,356 branches:u # 817.181 M/sec 12,605,235 branch-misses:u # 0.10% of all branches 15.312776274 seconds time elapsed 15.309462000 seconds user 0.003333000 seconds sys The difference seems close to noise. If I am counting right, with 100*1000*1000*1000 multiplications 5*100*1000*1000*1000/8=62500000000 cycles overall. Perhaps since the chain is independent for every 125 multilications it runs a bit fater. jh@alberti:~> gcc matrix-nofma.s ; perf stat ./a.out mult took 10046353 clocks Performance counter stats for './a.out': 10051.47 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 940 page-faults:u # 93.519 /sec 36983540385 cycles:u # 3.679 GHz (83.34%) 3535506 stalled-cycles-frontend:u # 0.01% frontend cycles idle (83.33%) 12252917 stalled-cycles-backend:u # 0.03% backend cycles idle (83.34%) 87650235892 instructions:u # 2.37 insn per cycle # 0.00 stalled cycles per insn (83.34%) 12504689935 branches:u # 1.244 G/sec (83.33%) 12606975 branch-misses:u # 0.10% of all branches (83.32%) 10.059089949 seconds time elapsed 10.048218000 seconds user 0.003998000 seconds sys jh@alberti:~> gcc matrix-fma.s ; perf stat ./a.out mult took 13147631 clocks Performance counter stats for './a.out': 13152.81 msec task-clock:u # 0.999 CPUs utilized 0 context-switches:u # 0.000 /sec 0 cpu-migrations:u # 0.000 /sec 940 page-faults:u # 71.468 /sec 48394201333 cycles:u # 3.679 GHz (83.32%) 4251637 stalled-cycles-frontend:u # 0.01% frontend cycles idle (83.32%) 13664772 stalled-cycles-backend:u # 0.03% backend cycles idle (83.34%) 75101376364 instructions:u # 1.55 insn per cycle # 0.00 stalled cycles per insn (83.35%) 12510705466 branches:u # 951.182 M/sec (83.34%) 12612898 branch-misses:u # 0.10% of all branches (83.33%) 13.162186067 seconds time elapsed 13.153354000 seconds user 0.000000000 seconds sys So here I wuld expet 3*100*1000*1000*1000/8=37500000000 cycles for first and 4*100*1000*1000*1000/8 = 50000000000 cycles for second. So again small over-statement apparently due to parallelism between vector multiplication, but overall it seems to match what I would expect to see. Honza > > > Since this seems noticeable win on zen and not loss on Core it seems like > > good > > default for generic. > > > > I plan to commit the patch next week if there are no compplains. > > complaint! > > Richard. > > > Honza > > > > #include <stdio.h> > > #include <time.h> > > > > #define SIZE 1000 > > > > float a[SIZE][SIZE]; > > float b[SIZE][SIZE]; > > float c[SIZE][SIZE]; > > > > void init(void) > > { > > int i, j, k; > > for(i=0; i<SIZE; ++i) > > { > > for(j=0; j<SIZE; ++j) > > { > > a[i][j] = (float)i + j; > > b[i][j] = (float)i - j; > > c[i][j] = 0.0f; > > } > > } > > } > > > > void mult(void) > > { > > int i, j, k; > > > > for(i=0; i<SIZE; ++i) > > { > > for(j=0; j<SIZE; ++j) > > { > > for(k=0; k<SIZE; ++k) > > { > > c[i][j] += a[i][k] * b[k][j]; > > } > > } > > } > > } > > > > int main(void) > > { > > clock_t s, e; > > > > init(); > > s=clock(); > > mult(); > > e=clock(); > > printf(" mult took %10d clocks\n", (int)(e-s)); > > > > return 0; > > > > } > > > > * confg/i386/x86-tune.def (X86_TUNE_AVOID_128FMA_CHAINS, > > X86_TUNE_AVOID_256FMA_CHAINS) > > Enable for znver4 and Core. > > > > diff --git a/gcc/config/i386/x86-tune.def b/gcc/config/i386/x86-tune.def > > index 43fa9e8fd6d..74b03cbcc60 100644 > > --- a/gcc/config/i386/x86-tune.def > > +++ b/gcc/config/i386/x86-tune.def > > @@ -515,13 +515,13 @@ DEF_TUNE (X86_TUNE_USE_SCATTER_8PARTS, > > "use_scatter_8parts", > > > > /* X86_TUNE_AVOID_128FMA_CHAINS: Avoid creating loops with tight 128bit or > > smaller FMA chain. */ > > -DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | > > m_ZNVER2 | m_ZNVER3 > > - | m_YONGFENG) > > +DEF_TUNE (X86_TUNE_AVOID_128FMA_CHAINS, "avoid_fma_chains", m_ZNVER1 | > > m_ZNVER2 | m_ZNVER3 | m_ZNVER4 > > + | m_YONGFENG | m_GENERIC) > > > > /* X86_TUNE_AVOID_256FMA_CHAINS: Avoid creating loops with tight 256bit or > > smaller FMA chain. */ > > -DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | > > m_ZNVER3 > > - | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM) > > +DEF_TUNE (X86_TUNE_AVOID_256FMA_CHAINS, "avoid_fma256_chains", m_ZNVER2 | > > m_ZNVER3 | m_ZNVER4 > > + | m_CORE_HYBRID | m_SAPPHIRERAPIDS | m_CORE_ATOM | m_GENERIC) > > > > /* X86_TUNE_AVOID_512FMA_CHAINS: Avoid creating loops with tight 512bit or > > smaller FMA chain. */