https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568
Bug ID: 80568 Summary: x86 -mavx256-split-unaligned-load (and store) is affecting AVX2 code, but probably shouldn't be. Product: gcc Version: 7.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Created attachment 41285 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=41285&action=edit bswap16.cc gcc7 (or at least the gcc8 snapshot on https://godbolt.org/g/ZafCE0) is now splitting unaligned loads/stores even for AVX2 integer, where gcc6.3 didn't. I think this is undesirable by default, because some projects probably build with -mavx2 but fail to use -mtune=haswell (or broadwell or skylake). For now, Intel CPUs that do well with 32B unaligned loads are probably the most common AVX2-supporting CPUs. IDK what's optimal for Excavator or Zen. Was this an intentional change to make those tune options work better for those CPUs? I would suggest that -mavx2 should imply -mno-avx256-split-unaligned-load (and -store) for -mtune=generic. Or if that's too ugly (having insn set selection affect tuning), then maybe just revert to the previous behaviour of having integer loads/store not be split the way FP loads/stores are. The conventional wisdom is that unaligned loads are just as fast as aligned when the data does happen to be aligned at run-time. Splitting this way badly breaks that assumption. It's inconvenient/impossible to portably communicate to the compiler that it should optimize for the case where the data is aligned, even if that's not guaranteed, so loadu / storeu are probably used in lots of code that normally runs on aligned data. Also, gcc doesn't always figure out that a hand-written scalar prologue does leave the pointer aligned for a vector loop. (And since programmers expect it not to matter, they may still use `_mm256_loadu_si256`). I reduced some real existing code that a colleague wrote into a test-case for this: https://godbolt.org/g/ZafCE0, also attached. If using potentially-overlapping first/last vectors instead of scalar loops, it might use loadu just to avoid duplicating a helper function. ---- For an example of affected code, consider an endian-swap function that uses this (inline) function in its inner loop. The code inside the loop matches what we get for compiling it stand-alone, so I'll just show that: #include <immintrin.h> // static inline void swap256(char* addr, __m256i mask) { __m256i vec = _mm256_loadu_si256((__m256i*)addr); vec = _mm256_shuffle_epi8(vec, mask); _mm256_storeu_si256((__m256i*)addr, vec); } gcc6.3 -O3 -mavx2: vmovdqu (%rdi), %ymm1 vpshufb %ymm0, %ymm1, %ymm0 vmovdqu %ymm0, (%rdi) g++ (GCC-Explorer-Build) 8.0.0 20170429 (experimental) -O3 -mavx2 vmovdqu (%rdi), %xmm1 vinserti128 $0x1, 16(%rdi), %ymm1, %ymm1 vpshufb %ymm0, %ymm1, %ymm0 vmovups %xmm0, (%rdi) vextracti128 $0x1, %ymm0, 16(%rdi)