https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568

            Bug ID: 80568
           Summary: x86 -mavx256-split-unaligned-load (and store) is
                    affecting AVX2 code, but probably shouldn't be.
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---

Created attachment 41285
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41285&action=edit
bswap16.cc

gcc7 (or at least the gcc8 snapshot on https://godbolt.org/g/ZafCE0) is now
splitting unaligned loads/stores even for AVX2 integer, where gcc6.3 didn't.

I think this is undesirable by default, because some projects probably build
with -mavx2 but fail to use -mtune=haswell (or broadwell or skylake).  For now,
Intel CPUs that do well with 32B unaligned loads are probably the most common
AVX2-supporting CPUs.

IDK what's optimal for Excavator or Zen.  Was this an intentional change to
make those tune options work better for those CPUs?

I would suggest that -mavx2 should imply -mno-avx256-split-unaligned-load (and
-store) for -mtune=generic.  Or if that's too ugly (having insn set selection
affect tuning), then maybe just revert to the previous behaviour of having
integer loads/store not be split the way FP loads/stores are.

 The conventional wisdom is that unaligned loads are just as fast as aligned
when the data does happen to be aligned at run-time.  Splitting this way badly
breaks that assumption.  It's inconvenient/impossible to portably communicate
to the compiler that it should optimize for the case where the data is aligned,
even if that's not guaranteed, so loadu / storeu are probably used in lots of
code that normally runs on aligned data.

Also, gcc doesn't always figure out that a hand-written scalar prologue does
leave the pointer aligned for a vector loop.  (And since programmers expect it
not to matter, they may still use `_mm256_loadu_si256`).  I reduced some real
existing code that a colleague wrote into a test-case for this:
https://godbolt.org/g/ZafCE0, also attached.    If using
potentially-overlapping first/last vectors instead of scalar loops, it might
use loadu just to avoid duplicating a helper function.


----

For an example of affected code, consider an endian-swap function that uses
this (inline) function in its inner loop.  The code inside the loop matches
what we get for compiling it stand-alone, so I'll just show that:

#include <immintrin.h>
// static inline
void swap256(char* addr, __m256i mask) {
        __m256i vec = _mm256_loadu_si256((__m256i*)addr);
        vec = _mm256_shuffle_epi8(vec, mask);
        _mm256_storeu_si256((__m256i*)addr, vec);
}


gcc6.3 -O3 -mavx2:
        vmovdqu (%rdi), %ymm1
        vpshufb %ymm0, %ymm1, %ymm0
        vmovdqu %ymm0, (%rdi)

g++ (GCC-Explorer-Build) 8.0.0 20170429 (experimental)  -O3 -mavx2
        vmovdqu (%rdi), %xmm1
        vinserti128     $0x1, 16(%rdi), %ymm1, %ymm1
        vpshufb %ymm0, %ymm1, %ymm0
        vmovups %xmm0, (%rdi)
        vextracti128    $0x1, %ymm0, 16(%rdi)

Reply via email to