https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124
Bug ID: 104124 Summary: Poor optimization for vector splat DW with small consts Product: gcc Version: 11.1.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- It looks to me like the compiler is seeing register pressure caused by loading all the vector long long constants I need in my code. This is leaf code of a size it can run out of volatilizes (no stack-frame). But this puts more pressure on volatile VRs, VSRs, and GPRs. Especially GPRs because it loading from .rodata when it could (and should) use a vector immediate. For example: vui64_t __test_splatudi_0_V0 (void) { return vec_splats ((unsigned long long) 0); } vi64_t __test_splatudi_1_V0 (void) { return vec_splats ((signed long long) -1); } Generate: 00000000000001a0 <__test_splatudi_0_V0>: 1a0: 8c 03 40 10 vspltisw v2,0 1a4: 20 00 80 4e blr 00000000000001c0 <__test_splatudi_1_V0>: 1c0: 8c 03 5f 10 vspltisw v2,-1 1c4: 20 00 80 4e blr ... But other cases that could use immedates like: vui64_t __test_splatudi_12_V0 (void) { return vec_splats ((unsigned long long) 12); } GCC 9/10/11 Generates for power8: 0000000000000170 <__test_splatudi_12_V0>: 170: 00 00 4c 3c addis r2,r12,0 170: R_PPC64_REL16_HA .TOC. 174: 00 00 42 38 addi r2,r2,0 174: R_PPC64_REL16_LO .TOC.+0x4 178: 00 00 22 3d addis r9,r2,0 178: R_PPC64_TOC16_HA .rodata.cst16+0x20 17c: 00 00 29 39 addi r9,r9,0 17c: R_PPC64_TOC16_LO .rodata.cst16+0x20 180: ce 48 40 7c lvx v2,0,r9 184: 20 00 80 4e blr and for Power9: 0000000000000000 <__test_splatisd_12_PWR9>: 0: d1 62 40 f0 xxspltib vs34,12 4: 02 16 58 10 vextsb2d v2,v2 8: 20 00 80 4e blr So why can't the power8 target generate: 00000000000000f0 <__test_splatudi_12_V1>: f0: 8c 03 4c 10 vspltisw v2,12 f4: 4e 16 40 10 vupkhsw v2,v2 f8: 20 00 80 4e blr This is 4 cycles vs 9 ((best case) and it is always 9 cycles because GCC does not exploit immediate fusion). In fact GCC 8 (AT12) does this. So I tried defining my own vec_splatudi: vi64_t __test_splatudi_12_V1 (void) { vi32_t vwi = vec_splat_s32 (12); return vec_unpackl (vwi); } Which generates the <__test_splatudi_12_V1> sequence above for GCC 8. But for GCC 9/10/11 it generates: 0000000000000110 <__test_splatudi_12_V1>: 110: 00 00 4c 3c addis r2,r12,0 110: R_PPC64_REL16_HA .TOC. 114: 00 00 42 38 addi r2,r2,0 114: R_PPC64_REL16_LO .TOC.+0x4 118: 00 00 22 3d addis r9,r2,0 118: R_PPC64_TOC16_HA .rodata.cst16+0x20 11c: 00 00 29 39 addi r9,r9,0 11c: R_PPC64_TOC16_LO .rodata.cst16+0x20 120: ce 48 40 7c lvx v2,0,r9 124: 20 00 80 4e blr Again! GCC has gone out of its way to be this clever! Badly! While it can be appropriately clever for power9! I have tried many permutations of this and the only way I have found to prevent this (GCC 9/10/11) cleverness is to use inline __asm (which has other bad side effects).