https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117093
Bug ID: 117093 Summary: Missing detection of REV64 vector permute Product: gcc Version: 15.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ktkachov at gcc dot gnu.org CC: tnfchris at gcc dot gnu.org Target Milestone: --- Target: aarch64 This testcase is reduced from a hashing code: #include <arm_neon.h> uint64x2_t ror32_neon_tgt_gcc_bad(uint64x2_t r) { uint32x4_t a = vreinterpretq_u32_u64 (r); uint32_t t; t = a[0]; a[0] = a[1]; a[1] = t; t = a[2]; a[2] = a[3]; a[3] = t; return vreinterpretq_u64_u32 (a); } LLVM is able to produce on aarch64: ror32_neon_tgt_gcc_bad(__Uint64x2_t): rev64 v0.4s, v0.4s ret Whereas GCC does: ror32_neon_tgt_gcc_bad(__Uint64x2_t): mov v31.16b, v0.16b ins v31.s[0], v0.s[1] ins v31.s[1], v0.s[0] ins v31.s[2], v0.s[3] ins v31.s[3], v0.s[2] mov v0.16b, v31.16b ret I'm not sure what part in GCC would handle this. Is that something SLP vectorisation would pick up and optimise using its permute logic? Or something bswap could do?