https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80355

            Bug ID: 80355
           Summary: Improve __builtin_shuffle on AVX512F
           Product: gcc
           Version: 7.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jakub at gcc dot gnu.org
  Target Milestone: ---

As mentioned in https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00375.html
we emit inefficient code for:
typedef long long V __attribute__((vector_size (64)));
typedef int W __attribute__((vector_size (64)));
W f0 (W x) {
  return __builtin_shuffle (x, (W) { 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3,
4, 5, 6, 7 });
}
V f1 (V x) {
  return __builtin_shuffle (x, (V) { 4, 5, 6, 7, 0, 1, 2, 3 });
}
e.g.
        vmovdqa64       .LC0(%rip), %zmm1
        vpermd  %zmm0, %zmm1, %zmm0
or
        vmovdqa64       .LC1(%rip), %zmm1
        vpermq  %zmm0, %zmm1, %zmm0
while we could use vpshufi64x2 instruction instead, which has just immediate.

Reply via email to