https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117818
Bug ID: 117818 Summary: vec_add incorrectly generates vadduwm for vector char const inputs. Product: gcc Version: 13.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: munroesj at gcc dot gnu.org Target Milestone: --- Created attachment 59731 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59731&action=edit Test case Vector shift long with const shift count -mcpu=power8 More specifically when the input to vec_add is generated by vec_splat_u8() and the result of the vec_add is passed into the shift count of some combination of vec_slo()/vec_sll() the compiler may generate vadduwm instead of vaddubm. This is very odd given that both vec_slo()/vec_sll() require a vector (un)signed char for the shift count (VRB). The combination vec_slo()/vec_sll() supports a 0-127 bit quadword left shift, while vec_sro()/vec_srl() supports a 0-127 bit quadword right shift. This requires a 7-bit shift count. Intrinsics vec_slo()/vec_sro() requires the high 4-bits of the shift count in VR bits 121:124. Intrinsics vec_sll()/vec_srl() requires the low 3-bits of the shift count in VR bits 125:127. However the PowerISA instructions vsl/vsr (corresponding to vec_sll()/vec_srl()) require that the shift count be splatted across all 16 bytes of the VRB. >From the PowerISA descriptions of vsl/vsr: The result is placed into VRT, except if, for any byte element in register VRB, the low-order 3 bits are not equal to the shift amount, then VRT is undefined. So it makes sense that intrinsics vec_slo()/vec_sll()/vec_sro()/vec_srl() require type vector char for VRB (shift count). It also makes sense that const shift counts would be generated by vec_splat_u8() and possibly some other intrinsic operations to extent the 5-bit SIM range (-16 to +15) to cover the 7-bit quadword shift counts required. Note: Loading a const vector from .rodata is extensive compared to short sequences of vector splat immediate and 0 to 2 additional operations. For example : vui8_t test_splat6_char_18 () { vui8_t tmp = vec_splat_u8(9); return vec_add (tmp, tmp); } You expect the compiler to generate: vspltisb 2,9 vaddubm 2,2,2 And it does. The quadword shift left by 18 example: vui128_t test_slqi_char_18_V1 (vui128_t vra) { vui8_t result; vui8_t tmp = vec_splat_u8(9); // 9 << 1 == 18 tmp = vec_add (tmp, tmp); result = vec_slo ((vui8_t) vra, tmp); return (vui128_t) vec_sll (result, tmp); } Actually generates: vspltisb 0,9 vadduwm 0,0,0 vslo 2,2,0 vsl 2,2,0 Note that the expected Vector Add Unsigned Byte Modulo (vaddubm) instruction has been replaced by Vector Add Unsigned Word Modulo (vadduwm). Technically the results are not incorrect for this specific value (no carries into adjacent bytes). But this fail for any of negative values (-16 to -1). Negative SIM values are required to cover the high range of the 7-bit shift count. For example a shift count of 110. vui8_t test_splat6_char_110 () { // 110-128 = -18 vui8_t tmp = vec_splat_u8(-9); return vec_add (tmp, tmp); } Add the compiler does generate the expected: vspltisb 2,-9 vaddubm 2,2,2 Good so far. Now try to shift long by 110 bits: vui128_t test_slqi_char_110_V1 (vui128_t vra) { vui8_t result; vui8_t tmp = vec_splat_u8(-9); tmp = vec_add (tmp, tmp); result = vec_slo ((vui8_t) vra, tmp); return (vui128_t) vec_sll (result, tmp); } A reasonable person would expect the compiler to generate: vspltisb 0,-9 vaddubm 0,0,0 vslo 2,2,0 vsl 2,2,0 And as recently as GCC 7 it did exactly that. But starting with GCC 8 we get the tricky word add for positive values and performance sucking .rodata load for any negative value. For example GCC 13 generates: addis 9,2,.LC0@toc@ha addi 9,9,.LC0@toc@l lvx 0,0,9 vslo 2,2,0 vsl 2,2,0 where the .rodata const .LC0 is: .align 4 .LC0: .long -286331154 .long -286331154 .long -286331154 .long -286331154 This is correct but unnecessary. It is odd that the .rodata is defined as 4 words (.long) and not 16 bytes (.byte) given that original const is vector char. So this is a bug and a regression starting with GCC 8