https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117818

            Bug ID: 117818
           Summary: vec_add incorrectly generates vadduwm for vector char
                    const inputs.
           Product: gcc
           Version: 13.3.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 59731
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59731&action=edit
Test case Vector shift long with const shift count -mcpu=power8

More specifically when the input to vec_add is generated by vec_splat_u8() and
the result of the vec_add is passed into the shift count of some combination of
vec_slo()/vec_sll() the compiler may generate vadduwm instead of vaddubm. This
is very odd given that both vec_slo()/vec_sll() require a vector (un)signed
char for the shift count (VRB).

The combination vec_slo()/vec_sll() supports a 0-127 bit quadword left shift,
while vec_sro()/vec_srl() supports a 0-127 bit quadword right shift. This
requires a 7-bit shift count. Intrinsics vec_slo()/vec_sro() requires the high
4-bits of the shift count in VR bits 121:124. Intrinsics vec_sll()/vec_srl()
requires the low 3-bits of the shift count in VR bits 125:127. However the
PowerISA instructions vsl/vsr (corresponding to vec_sll()/vec_srl()) require
that the shift count be splatted across all 16 bytes of the VRB.

>From the PowerISA descriptions of vsl/vsr:

The result is placed into VRT, except if, for any byte
element in register VRB, the low-order 3 bits are not
equal to the shift amount, then VRT is undefined.

So it makes sense that intrinsics vec_slo()/vec_sll()/vec_sro()/vec_srl()
require type vector char for VRB (shift count).
It also makes sense that const shift counts would be generated by
vec_splat_u8() and possibly some other intrinsic operations to extent the 5-bit
SIM range (-16 to +15) to cover the 7-bit quadword shift counts required.

Note: Loading a const vector from .rodata is extensive compared to short
sequences of vector splat immediate and 0 to 2 additional operations.

For example : 

vui8_t test_splat6_char_18 ()
{
  vui8_t tmp = vec_splat_u8(9);
  return vec_add (tmp, tmp);
}

You expect the compiler to generate:
        vspltisb 2,9
        vaddubm 2,2,2

And it does.

The quadword shift left by 18 example:

vui128_t test_slqi_char_18_V1 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splat_u8(9);
  // 9 << 1 == 18
  tmp = vec_add (tmp, tmp);
  result = vec_slo ((vui8_t) vra, tmp);
  return (vui128_t) vec_sll (result, tmp);
}

Actually generates:

        vspltisb 0,9
        vadduwm 0,0,0
        vslo 2,2,0
        vsl 2,2,0

Note that the expected Vector Add Unsigned Byte Modulo (vaddubm) instruction
has been replaced by Vector Add Unsigned Word Modulo (vadduwm).

Technically the results are not incorrect for this specific value (no carries
into adjacent bytes). But this fail for any of negative values (-16 to -1).
Negative SIM values are required to cover the high range of the 7-bit shift
count.

For example a shift count of 110. 

vui8_t
test_splat6_char_110 ()
{ // 110-128 = -18
  vui8_t tmp = vec_splat_u8(-9);
  return vec_add (tmp, tmp);
}

Add the compiler does generate the expected:
        vspltisb 2,-9
        vaddubm 2,2,2

Good so far. Now try to shift long by 110 bits:

vui128_t
test_slqi_char_110_V1 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splat_u8(-9);
  tmp = vec_add (tmp, tmp);
  result = vec_slo ((vui8_t) vra, tmp);
  return (vui128_t) vec_sll (result, tmp);
}

A reasonable person would expect the compiler to generate:

        vspltisb 0,-9
        vaddubm 0,0,0
        vslo 2,2,0
        vsl 2,2,0

And as recently as GCC 7 it did exactly that. But starting with GCC 8 we get
the tricky word add for positive values and performance sucking .rodata load
for any negative value.

For example GCC 13 generates:

        addis 9,2,.LC0@toc@ha
        addi 9,9,.LC0@toc@l
        lvx 0,0,9
        vslo 2,2,0
        vsl 2,2,0

where the .rodata const .LC0 is:

        .align 4
.LC0:
        .long   -286331154
        .long   -286331154
        .long   -286331154
        .long   -286331154

This is correct but unnecessary. It is odd that the .rodata is defined as 4
words (.long) and not 16 bytes (.byte) given that original const is vector
char.

So this is a bug and a regression starting with GCC 8

Reply via email to