https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103850
Bug ID: 103850 Summary: missed optimization in AVX code Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: mar...@mpa-garching.mpg.de Target Milestone: --- Created attachment 52076 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52076&action=edit test case (I'm reporting this under "C" because I don't know which optimizer is responsible for this, but I observe the same beaviour in C++ programs as well.) This test case was distilled from a hot loop in a library computing spherical harmonic transforms. Apparently it can be compiled in a way that gives close to theoretical peak performance at least on my hardware (Zen 2), but this only happens if the statements in the inner loop are arranged in a specific way. Trivial rearrangements result in a performance which is about 30% lower. I would have expected that gcc would be able to spot this kind of rearrangement and do it by itself, but this doesn't seem the case at the moment. If that could be fixed, that would obviously be great, but if not, I'd be grateful for any tips how the most "efficient" arrangements can be found for such critical loops without resorting to trial and error. The loops in question start at lines 27 and 78 in the attached test case. On my machine the code reports slow kernel version: 45.317578 GFlops/s fast kernel version: 67.083952 GFlops/s when compiled with "-O3 -march=znver2 -ffast-math -W -Wall" Clang and Intel icx show the same discrepancy, so it seems that the required re-ordering is indeed hard to do.