https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102055
--- Comment #2 from Andrew Pinski <pinskia at gcc dot gnu.org> --- The use of ldr/tbl vs rev64/ext is questionable and depend on if we are inside a loop or not. In the case of it being inside the loop and there are enough registers, then using TBL is better on many (not all though) micro-arches as it is similar latency as rev64. Though I should note that clang/LLVM implements it as rev64/ext. E.g.: ``` #define vector __attribute__((vector_size(16))) vector char g(vector char a) { return __builtin_shufflevector (a,a,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1, 0); } vector char g1(vector char a) { vector char t= __builtin_shufflevector (a,a,7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8); vector long long t1 = (vector long long)t; t1 = __builtin_shufflevector(t1,t1, 1,0); return (vector char)t1; } ``` Produces: ``` rev64 v0.16b, v0.16b ext v0.16b, v0.16b, v0.16b, #8 ``` For both.