https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243
Bug ID: 92243
Summary: Missing "auto-vectorization" of char array reversal
using x86 scalar bswap when SIMD pshufb isn't
available
Product: gcc
Version: 10.0
Status: UNCONFIRMED
Keywords: missed-optimization
Severity: normal
Priority: P3
Component: tree-optimization
Assignee: unassigned at gcc dot gnu.org
Reporter: peter at cordes dot ca
Target Milestone: ---
Target: x86_64-*-*, i?86-*-*
We could use integer bswap to speed up an in-place byte-reverse loop by a
factor of probably 8, the same way we uses SIMD shuffles.
Consider this loop which reverses an explicit-length char array:
https://godbolt.org/z/ujXq_J
typedef char swapt; // int can auto-vectorize with just SSE2
void strrev_explicit(swapt *head, long len)
{
swapt *tail = head + len - 1;
for( ; head < tail; ++head, --tail) {
swapt h = *head, t = *tail;
*head = t;
*tail = h;
}
}
gcc -O3 (including current trunk) targeting x86-64 makes naive scalar
byte-at-a-time code, even though bswap r64 is available to byte-reverse a
uint64 in 1 or 2 uops (AMD and Intel, respectively).
With -mssse3, we do see auto-vectorization using SIMD pshufb (after checking
lengths and calculating how many 16-byte chunks can be done before bloated
fully-unrolled cleanup). Doing the same thing with 64-bit integer registers
would be very much worth it (for code where a loop like this was a bottleneck).
----
With `swapt = short`, vectorizing with SSE2 pshuflw / pshufhw / pshufd is
probably worth it, but GCC chooses not to do that either. Or working in 8-byte
chunks just using movq + pshuflw, so we only have 1 shuffle per 8-byte
load/store instead of 3 per 16-byte store. That's a good balance for modern
Intel (Haswell, Skylake, and I think IceLake), although some AMD and earlier
Intel with more integer shuffle throughput (e.g. Sandybridge) might do better
with 3x shuffles per 16-byte load/store.