https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102758
Bug ID: 102758 Summary: [x86] Failure to use registers optimally when swapping between (identically represented) vector types Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: gabravier at gmail dot com Target Milestone: --- #include <stdint.h> typedef int64_t v2i64 __attribute__((vector_size(16))); typedef uint16_t v8u16 __attribute__((vector_size(16))); v2i64 f(v8u16 make_b_xxm1, v2i64 b) { return (v2i64)((v8u16)b + (v8u16){1}); } With -O3, GCC outputs this: f(unsigned short __vector(8), long __vector(2)): movdqa xmm2, XMMWORD PTR .LC0[rip] paddw xmm2, xmm1 movdqa xmm0, xmm2 ret LLVM outputs this: f(unsigned short __vector(8), long __vector(2)): movdqa xmm0, xmm1 paddw xmm0, xmmword ptr [rip + .LCPI0_0] ret It should be possible to optimize out the last `movdqa`. This seems to be directly related to the usage of differing types here (even though the conversion is cost-free) as replacing all usage of `v2i64` with `v8u16` makes this be better optimized.