[Bug c/103850] New: missed optimization in AVX code

martin--- via Gcc-bugs Tue, 28 Dec 2021 02:10:10 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850


            Bug ID: 103850
           Summary: missed optimization in AVX code
           Product: gcc
           Version: 12.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: [email protected]
  Target Milestone: ---

Created attachment 52076
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52076&action=edit
test case

(I'm reporting this under "C" because I don't know which optimizer is
responsible for this, but I observe the same beaviour in C++ programs as well.)

This test case was distilled from a hot loop in a library computing spherical
harmonic transforms. Apparently it can be compiled in a way that gives close to
theoretical peak performance at least on my hardware (Zen 2), but this only
happens if the statements in the inner loop are arranged in a specific way.
Trivial rearrangements result in a performance which is about 30% lower.

I would have expected that gcc would be able to spot this kind of rearrangement
and do it by itself, but this doesn't seem the case at the moment. If that
could be fixed, that would obviously be great, but if not, I'd be grateful for
any tips how the most "efficient" arrangements can be found for such critical
loops without resorting to trial and error.

The loops in question start at lines 27 and 78 in the attached test case.
On my machine the code reports

slow kernel version: 45.317578 GFlops/s
fast kernel version: 67.083952 GFlops/s

when compiled with "-O3 -march=znver2 -ffast-math -W -Wall"

Clang and Intel icx show the same discrepancy, so it seems that the required
re-ordering is indeed hard to do.

[Bug c/103850] New: missed optimization in AVX code

Reply via email to