https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121274
Bug ID: 121274 Summary: xmm extraction from zmm vector emits unnecessary vpextrq/vpinsrq sequence Product: gcc Version: 16.0 Status: UNCONFIRMED Keywords: missed-optimization Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: mkretz at gcc dot gnu.org Target Milestone: --- Target: x86_64-*-*, i?86-*-* Test case (https://compiler-explorer.com/z/PP36hnv55): using V [[gnu::vector_size(64)]] = int; auto f(V x) { return __builtin_shufflevector(x, x, 0, 1, 2, 3); } With -march=skylake-avx512 this compiles to vpextrq rdx, xmm0, 1 vpinsrq xmm0, xmm0, rdx, 1 ret which should instead be a simple 'ret'. The same issue happens on extraction of the other 128-bit parts (https://compiler-explorer.com/z/YYv8a1WTb): __builtin_shufflevector(x, x, 4, 5, 6, 7); is compiled to: vextracti32x4 xmm2, zmm0, 1 vpextrq rdx, xmm2, 1 vpinsrq xmm0, xmm2, rdx, 1 ret The expected result is: vextracti32x4 xmm2, zmm0, 1 ret >From my limited understanding of GCC tree dumps, I believe the "optimized (tree)" pass is fine and the issue happens in the x86-target specific code?