https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82356

            Bug ID: 82356
           Summary: auto-vectorizing pack of 16->8 has a redundant AND
                    after a shift
           Product: gcc
           Version: 8.0
            Status: UNCONFIRMED
          Keywords: missed-optimization, ssemmx
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

#include <stdint.h>
void pack_high8_baseline(uint8_t *__restrict__ dst, const uint16_t
*__restrict__ src, size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
     *dst++ = *src++ >> 8;
  } while(dst < end_dst);
}

https://godbolt.org/g/yoJZ3C
gcc -O3  auto-vectorizes to this inner loop:

.L5:
        movdqa  (%rdx,%rax,2), %xmm0
        movdqa  16(%rdx,%rax,2), %xmm1
        psrlw   $8, %xmm0
        pand    %xmm2, %xmm0             # Redundant with the shift
        psrlw   $8, %xmm1
        pand    %xmm2, %xmm1             # Redundant with the shift
        packuswb        %xmm1, %xmm0
        movups  %xmm0, (%rcx,%rax)
        addq    $16, %rax
        cmpq    %rsi, %rax
        jne     .L5

This is mostly good, but the PAND instructions are redundant, because psrlw by
8 already leaves the high byte of each 16-bit element zeroed.

The same extra AND is present when auto-vectorizing for AVX2 (but not AVX512,
where it uses a different strategy.)  Other than that, the AVX2 vectorization
strategy looks very good (packus ymm, then vpermq to fix the result).

If the input is 32B-aligned (or 64B aligned for AVX2), one of the load+shifts
can be *replaced* with an AND + unaligned load offset by -1.  This avoids a
bottleneck on shift throughput (at least with unrolling it does; without
unrolling we bottleneck on the front-end except on Ryzen).  It's even better
with AVX, because load+AND can fold into one instruction.

See https://stackoverflow.com/a/46477080/224132 for more details.

This C source produces the inner loop that I think should be very good across
K10, Bulldozer, Ryzen,  Nehalem, Sandybridge, HSW/SKL, Jaguar, Atom, and
Silvermont.  (With SSE2 or AVX.)  i.e. this should be great for tune=generic
after reaching a 32B boundary.

Not great on Core2 or K8 where non-cache-line-crossing movdqu costs more.

// take both args as uint8_t* so we can offset by 1 byte to replace a shift
with an AND
// if src is 32B-aligned, we never have cache-line splits
void pack_high8_alignhack(uint8_t *restrict dst, const uint8_t *restrict src,
size_t bytes) {
  uint8_t *end_dst = dst + bytes;
  do{
     __m128i v0 = _mm_loadu_si128((__m128i*)src);  // this load should be
aligned
     __m128i v1_offset = _mm_loadu_si128(1+(__m128i*)(src-1));
     v0 = _mm_srli_epi16(v0, 8);
     __m128i v1 = _mm_and_si128(v1_offset, _mm_set1_epi16(0x00FF));
     __m128i pack = _mm_packus_epi16(v0, v1);
     _mm_storeu_si128((__m128i*)dst, pack);
     dst += 16;
     src += 32;  // 32 bytes
  } while(dst < end_dst);
}

Reply via email to