https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118480
Bug ID: 118480 Summary: Power9 target generates poor code for vector char splat immediate. Product: gcc Version: 13.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- POWER9 (PowerISA 3.0C) adds the VSX Vector Splat Immediate Byte (xxspltib) instruction that is perfect for generating small integer constants for vector char values. GCC with (sometimes) generates xxspltib, but other times will inexplicably generate a 1/2 instuction original Altivec (PowerISA 2.03) sequence OR a vector const in .rodata and generate code to load the vector. For example generate a vector char of 15's and use that as a shift-count for shift left quadword 15 bits. vui8_t test_splat7_char_15_V1 () { return vec_splats((unsigned char)15); } test_splat7_char_15_V1: xxspltib 34,15 blr vui128_t test_slqi_char_15_V1 (vui128_t vra) { vui8_t result; vui8_t tmp = vec_splats((unsigned char)15); result = vec_slo ((vui8_t) vra, tmp); return (vui128_t) vec_vsl (result, tmp); } test_slqi_char_15_V1: vspltisb 0,15 vslo 2,2,0 vsl 2,2,0 blr Note that a standalone vec_splats((unsigned char)15) generates: xxspltib 34,15 But passing the splatted 15 vector to vec_slo/vec_sll (shift left long (quadword) 15 ) generated: vspltisb 0,15 vslo 2,2,0 vsl 2,2,0 Why/how the xxspltib was converted to vspltisb is not clear. For this specific value (15) this is Ok. The vspltisb can handle 15 (5-bit SIM) as well as xxspltib (8-bit IMM8). But it is a bit strange. Now lets look at some cases where the required (unsigned) constant does not fit a 5-bit SIM field but fits nicely in the POWER9 xxspltib 8-bit immediate field. For example: vui8_t test_splat7_char_18 () { return vec_splats((unsigned char)18); } test_splat7_char_18: xxspltib 34,9 vaddubm 2,2,2 blr The compiler generates the xxspltib but does not believe that the 18 fits into the immediate field. This is true for vspltisb not for xxspltib. Now use this constant in a shift left quadword. for example: vui128_t test_slqi_char_18_V3 (vui128_t vra) { vui8_t result; vui8_t tmp = vec_splats((unsigned char)18); result = vec_vslo ((vui8_t) vra, tmp); return (vui128_t) vec_vsl (result, tmp); } test_slqi_char_18_V3: .cfi_startproc vspltisb 0,9 vadduwm 0,0,0 vslo 2,2,0 vsl 2,2,0 blr Again we see the conversion from 18 to (9 * 2). Not incorrect but not optimal. For P9 the dependent sequence xxspltib/vslo/vsl would be 9 cycles latency. The sequence above is 12 cycles. Now we will look at some larger shift counts for example 116. Note: A quadword shift requires a 7-bit shift-count (bits 121:124 for vslo/vsro and bits 125:127 for vsl/vsr). The 3-bit shift count for vsl/vsr must be splatted across all 16 bytes. So it is simpler to generate 7 bit shift count splatted across the bytes and use that for both. For example: vui8_t test_splat1_char_116_V2 () { return vec_splats ((unsigned char)116); } test_splat1_char_116_V2: xxspltib 34,116 blr Good the compiler generated a single xxspltib. Excellent! And: vui8_t test_slqi_char_116_V3 (vui8_t vra) { vui8_t result; vui8_t tmp = vec_splats((unsigned char)116); result = vec_slo (vra, tmp); return vec_sll (result, tmp); } test_slqi_char_116_V3: addis 9,2,.LC15@toc@ha addi 9,9,.LC15@toc@l lxv 32,0(9) vslo 2,2,0 vsl 2,2,0 blr What happened here? It could (should) have been the xxspltib/vslo/vsl sequence but the compiler when out of its way to generate a vector constant in .rodata and loads it from storage. This is (9+6=15) cycles minimum (L1 cache hit) as generated. We would do better using the POWER8 code sequence. For example: vui8_t test_slqi_char_116_V0 (vui8_t vra) { vui8_t result; // 116-128 = -12 vui8_t tmp = vec_splat_u8(-12); result = vec_slo (vra, tmp); return vec_sll (result, tmp); } test_slqi_char_116_V0: vspltisb 0,-12 vslo 2,2,0 vsl 2,2,0 blr This works because the lower 7-bits of -12 (0b11110100) is 0b1110100 == 116 (the processor ignores the high-order bit!). This is (3+3+3=9) cycles minimum as generated for POWER9. This trick works for 0-15 and 112-127 (-16 to -1) but gets more complicated for the range 16-111 which requires 2-5 instructions to generate 7-bit shift counts for POWER8. For POWER9 it is always better to generate a xxspltib for vector (unsigned) char splat (vec_splat_u8() / vec_splats()) and quadword shift counts.