https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243

            Bug ID: 92243
           Summary: Missing "auto-vectorization" of char array reversal
                    using x86 scalar bswap when SIMD pshufb isn't
                    available
           Product: gcc
           Version: 10.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: peter at cordes dot ca
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

We could use integer bswap to speed up an in-place byte-reverse loop by a
factor of probably 8, the same way we uses SIMD shuffles.

Consider this loop which reverses an explicit-length char array:
https://godbolt.org/z/ujXq_J

typedef char swapt; // int can auto-vectorize with just SSE2
void strrev_explicit(swapt *head, long len)
{
  swapt *tail = head + len - 1;
  for( ; head < tail; ++head, --tail) {
      swapt h = *head, t = *tail;
      *head = t;
      *tail = h;
  }
}

gcc -O3 (including current trunk) targeting x86-64 makes naive scalar
byte-at-a-time code, even though bswap r64 is available to byte-reverse a
uint64 in 1 or 2 uops (AMD and Intel, respectively).

With -mssse3, we do see auto-vectorization using SIMD pshufb (after checking
lengths and calculating how many 16-byte chunks can be done before bloated
fully-unrolled cleanup).  Doing the same thing with 64-bit integer registers
would be very much worth it (for code where a loop like this was a bottleneck).

----

With `swapt = short`, vectorizing with SSE2 pshuflw / pshufhw / pshufd is
probably worth it, but GCC chooses not to do that either.  Or working in 8-byte
chunks just using movq + pshuflw, so we only have 1 shuffle per 8-byte
load/store instead of 3 per 16-byte store.  That's a good balance for modern
Intel (Haswell, Skylake, and I think IceLake), although some AMD and earlier
Intel with more integer shuffle throughput (e.g. Sandybridge) might do better
with 3x shuffles per 16-byte load/store.

Reply via email to