https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80355
Bug ID: 80355 Summary: Improve __builtin_shuffle on AVX512F Product: gcc Version: 7.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jakub at gcc dot gnu.org Target Milestone: --- As mentioned in https://gcc.gnu.org/ml/gcc-patches/2017-04/msg00375.html we emit inefficient code for: typedef long long V __attribute__((vector_size (64))); typedef int W __attribute__((vector_size (64))); W f0 (W x) { return __builtin_shuffle (x, (W) { 8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7 }); } V f1 (V x) { return __builtin_shuffle (x, (V) { 4, 5, 6, 7, 0, 1, 2, 3 }); } e.g. vmovdqa64 .LC0(%rip), %zmm1 vpermd %zmm0, %zmm1, %zmm0 or vmovdqa64 .LC1(%rip), %zmm1 vpermq %zmm0, %zmm1, %zmm0 while we could use vpshufi64x2 instruction instead, which has just immediate.