https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121274

            Bug ID: 121274
           Summary: xmm extraction from zmm vector emits unnecessary
                    vpextrq/vpinsrq sequence
           Product: gcc
           Version: 16.0
            Status: UNCONFIRMED
          Keywords: missed-optimization
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: mkretz at gcc dot gnu.org
  Target Milestone: ---
            Target: x86_64-*-*, i?86-*-*

Test case (https://compiler-explorer.com/z/PP36hnv55):

using V [[gnu::vector_size(64)]] = int;

auto f(V x)
{
  return __builtin_shufflevector(x, x, 0, 1, 2, 3);
}

With -march=skylake-avx512 this compiles to

        vpextrq rdx, xmm0, 1
        vpinsrq xmm0, xmm0, rdx, 1
        ret

which should instead be a simple 'ret'.

The same issue happens on extraction of the other 128-bit parts
(https://compiler-explorer.com/z/YYv8a1WTb):

  __builtin_shufflevector(x, x, 4, 5, 6, 7);

is compiled to:

        vextracti32x4   xmm2, zmm0, 1
        vpextrq rdx, xmm2, 1
        vpinsrq xmm0, xmm2, rdx, 1
        ret

The expected result is:

        vextracti32x4   xmm2, zmm0, 1
        ret


>From my limited understanding of GCC tree dumps, I believe the "optimized
(tree)" pass is fine and the issue happens in the x86-target specific code?

Reply via email to