[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

amonakov at gcc dot gnu.org via Gcc-bugs Sat, 26 Nov 2022 11:37:03 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832


--- Comment #21 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
(In reply to Michael_S from comment #19)
> > Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
> > 'unlaminated' (turned to 2 uops before renaming), so selecting independent
> > IVs for the two arrays actually helps on this testcase.
> 
> Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx),
> %ymm3, %ymm0' would be turned into 2 uops.

The difference is at which point in the pipeline. The latter goes through
renaming as one fused uop.

> Misuse of load+op is far bigger problem in this particular test case than
> sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns
> loop that can potentially run at 3 clocks per iteration into loop of 4+
> clocks per iteration.

Sorry, which assembler output this refers to?

> But I consider it a separate issue. I reported similar issue in 97127, but
> here it is more serious. It looks to me that the issue is not soluble within
> existing gcc optimization framework. The only chance is if you accept my old
> and simple advice - within inner loops pretend that AVX is RISC, i.e.
> generate code as if load-op form of AVX instructions weren't existing.

In bug 97127 the best explanation we have so far is we don't optimally handle
the case where non-memory inputs of an fma are reused, so we can't combine a
load with an fma without causing an extra register copy (PR 97127 comment 16
demonstrates what I mean). I cannot imagine such trouble arising with more
common commutative operations like mul/add, especially with non-destructive VEX
encoding. If you hit such examples, I would suggest to report them also,
because their root cause might be different.

In general load-op combining should be very helpful on x86, because it reduces
the number of uops flowing through the renaming stage, which is one of the
narrowest points in the pipeline.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

Reply via email to