[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

already5chosen at yahoo dot com via Gcc-bugs Sat, 26 Nov 2022 10:27:44 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=97832


--- Comment #19 from Michael_S <already5chosen at yahoo dot com> ---
(In reply to Alexander Monakov from comment #18)
> The apparent 'bias' is introduced by instruction scheduling: haifa-sched
> lifts a +64 increment over memory accesses, transforming +0 and +32
> displacements to -64 and -32. Sometimes this helps a little bit even on
> modern x86 CPUs.

I don't think that it ever helps on Intel Sandy Bridge or later or on AMD Zen1
or later.

> 
> Also note that 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' would be
> 'unlaminated' (turned to 2 uops before renaming), so selecting independent
> IVs for the two arrays actually helps on this testcase.

Both 'vfnmadd231pd 32(%rdx,%rax), %ymm3, %ymm0' and 'vfnmadd231pd 32(%rdx),
%ymm3, %ymm0' would be turned into 2 uops.

Misuse of load+op is far bigger problem in this particular test case than
sub-optimal loop overhead. Assuming execution on Intel Skylake, it turns loop
that can potentially run at 3 clocks per iteration into loop of 4+ clocks per
iteration.
But I consider it a separate issue. I reported similar issue in 97127, but here
it is more serious. It looks to me that the issue is not soluble within
existing gcc optimization framework. The only chance is if you accept my old
and simple advice - within inner loops pretend that AVX is RISC, i.e. generate
code as if load-op form of AVX instructions weren't existing.

[Bug tree-optimization/97832] AoSoA complex caxpy-like loops: AVX2+FMA -Ofast 7 times slower than -O3

Reply via email to