https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90283
--- Comment #4 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org>
---
(In reply to Martin Liška from comment #3)
> The perf comes from an Intel Skylake server machine.
>
> The number of fma is very similar:
> grep fma bad.report.txt | wc -l
> 126
> grep fma good.report.txt | wc -l
> 128
Grepping for vfm also includes the vfmsubs etc., with the same gap:
bad.report.txt:167
good.report.txt:169
The distribution also looks similar:
$ sed -n 's/.*\(vfm[^ ]*\).*/\1/p' good.report.txt | sort | uniq -c
61 vfmadd132sd
1 vfmadd132ss
35 vfmadd213sd
30 vfmadd231sd
1 vfmadd231ss
32 vfmsub132sd
1 vfmsub213sd
8 vfmsub231sd
$ sed -n 's/.*\(vfm[^ ]*\).*/\1/p' bad.report.txt | sort | uniq -c
60 vfmadd132sd
1 vfmadd132ss
35 vfmadd213sd
29 vfmadd231sd
1 vfmadd231ss
29 vfmsub132sd
1 vfmsub213sd
11 vfmsub231sd
> But the assembly is shuffled quite significantly after the change. Can you
> Richard Sandiford please take a look?
I think I'm going to need more clues why the new code is so much
slower in practice. Could someone more familiar with the architecture
comment?