https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007
--- Comment #15 from Steven Munroe <munroesj at gcc dot gnu.org> --- Found where handling of vec_splat_u32 constant shift counts are handled differently across the various shift/rotate intrinsics. Even for the 5-bit shift counts (the easy case) the behavior of the various shift/rotate intrinsic are inconsistent. The compiler pays way to much attention to how the shift count is generated but differently between shift left/right word and different again for rotate left word. Any reasonable person would assume that using vec_splat_u32() for any shift value 1 to 31 (-16 to 15) will generate efficient code. And it does for vec_vslw() which generates two instructions (vspltisw v0,-16; vslw v2,v2,v0). But the compiler behaves differently for vec_vsrw() and vec_vsraw(): - for values 1-15 generates: - vspltisw v0,15; vsrw v2,v2,v0 - for even values between 16 - 30 - vspltisw v0,8; vadduwm v0,v0,v0; vsrw v2,v2,v0 - for odd values between 17 - 31 generates a load for .rodata And positively strange for vec_vrlw(): - for values 1-15 it generates: - vspltisw v0,15; vrlw v2,v2,v0 - but for any value between 16 - 31 it gets strange: 0000000000001200 <test_rlwi_16>: 1200: 30 00 20 39 li r9,48 1204: 8c 03 00 10 vspltisw v0,0 1208: 67 01 29 7c mtvrd v1,r9 120c: 93 0a 21 f0 xxspltw vs33,vs33,1 1210: 80 0c 00 10 vsubuwm v0,v0,v1 1214: 84 00 42 10 vrlw v2,v2,v0 1218: 20 00 80 4e blr