https://gcc.gnu.org/bugzilla/show_bug.cgi?id=92243
Bug ID: 92243 Summary: Missing "auto-vectorization" of char array reversal using x86 scalar bswap when SIMD pshufb isn't available Product: gcc Version: 10.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: peter at cordes dot ca Target Milestone: --- Target: x86_64-*-*, i?86-*-* We could use integer bswap to speed up an in-place byte-reverse loop by a factor of probably 8, the same way we uses SIMD shuffles. Consider this loop which reverses an explicit-length char array: https://godbolt.org/z/ujXq_J typedef char swapt; // int can auto-vectorize with just SSE2 void strrev_explicit(swapt *head, long len) { swapt *tail = head + len - 1; for( ; head < tail; ++head, --tail) { swapt h = *head, t = *tail; *head = t; *tail = h; } } gcc -O3 (including current trunk) targeting x86-64 makes naive scalar byte-at-a-time code, even though bswap r64 is available to byte-reverse a uint64 in 1 or 2 uops (AMD and Intel, respectively). With -mssse3, we do see auto-vectorization using SIMD pshufb (after checking lengths and calculating how many 16-byte chunks can be done before bloated fully-unrolled cleanup). Doing the same thing with 64-bit integer registers would be very much worth it (for code where a loop like this was a bottleneck). ---- With `swapt = short`, vectorizing with SSE2 pshuflw / pshufhw / pshufd is probably worth it, but GCC chooses not to do that either. Or working in 8-byte chunks just using movq + pshuflw, so we only have 1 shuffle per 8-byte load/store instead of 3 per 16-byte store. That's a good balance for modern Intel (Haswell, Skylake, and I think IceLake), although some AMD and earlier Intel with more integer shuffle throughput (e.g. Sandybridge) might do better with 3x shuffles per 16-byte load/store.