https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90283
--- Comment #4 from rsandifo at gcc dot gnu.org <rsandifo at gcc dot gnu.org> --- (In reply to Martin Liška from comment #3) > The perf comes from an Intel Skylake server machine. > > The number of fma is very similar: > grep fma bad.report.txt | wc -l > 126 > grep fma good.report.txt | wc -l > 128 Grepping for vfm also includes the vfmsubs etc., with the same gap: bad.report.txt:167 good.report.txt:169 The distribution also looks similar: $ sed -n 's/.*\(vfm[^ ]*\).*/\1/p' good.report.txt | sort | uniq -c 61 vfmadd132sd 1 vfmadd132ss 35 vfmadd213sd 30 vfmadd231sd 1 vfmadd231ss 32 vfmsub132sd 1 vfmsub213sd 8 vfmsub231sd $ sed -n 's/.*\(vfm[^ ]*\).*/\1/p' bad.report.txt | sort | uniq -c 60 vfmadd132sd 1 vfmadd132ss 35 vfmadd213sd 29 vfmadd231sd 1 vfmadd231ss 29 vfmsub132sd 1 vfmsub213sd 11 vfmsub231sd > But the assembly is shuffled quite significantly after the change. Can you > Richard Sandiford please take a look? I think I'm going to need more clues why the new code is so much slower in practice. Could someone more familiar with the architecture comment?