https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117093

            Bug ID: 117093
           Summary: Missing detection of REV64 vector permute
           Product: gcc
           Version: 15.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: tree-optimization
          Assignee: unassigned at gcc dot gnu.org
          Reporter: ktkachov at gcc dot gnu.org
                CC: tnfchris at gcc dot gnu.org
  Target Milestone: ---
            Target: aarch64

This testcase is reduced from a hashing code:
#include <arm_neon.h>

uint64x2_t ror32_neon_tgt_gcc_bad(uint64x2_t r) {
    uint32x4_t a = vreinterpretq_u32_u64 (r);
    uint32_t t;
    t = a[0]; a[0] = a[1]; a[1] = t;
    t = a[2]; a[2] = a[3]; a[3] = t;
    return vreinterpretq_u64_u32 (a);
}

LLVM is able to produce on aarch64:
ror32_neon_tgt_gcc_bad(__Uint64x2_t):
        rev64   v0.4s, v0.4s
        ret

Whereas GCC does:
ror32_neon_tgt_gcc_bad(__Uint64x2_t):
        mov     v31.16b, v0.16b
        ins     v31.s[0], v0.s[1]
        ins     v31.s[1], v0.s[0]
        ins     v31.s[2], v0.s[3]
        ins     v31.s[3], v0.s[2]
        mov     v0.16b, v31.16b
        ret

I'm not sure what part in GCC would handle this. Is that something SLP
vectorisation would pick up and optimise using its permute logic? Or something
bswap could do?

Reply via email to