[Bug target/118480] New: Power9 target generates poor code for vector char splat immediate.

munroesj at gcc dot gnu.org via Gcc-bugs Tue, 14 Jan 2025 11:51:03 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118480


            Bug ID: 118480
           Summary: Power9 target generates poor code for vector char
                    splat immediate.
           Product: gcc
           Version: 13.2.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

POWER9 (PowerISA 3.0C) adds the VSX Vector Splat Immediate Byte (xxspltib)
instruction that is perfect for generating small integer constants for vector
char values. GCC with (sometimes) generates xxspltib, but other times will
inexplicably generate a 1/2 instuction original Altivec (PowerISA 2.03)
sequence OR a vector const in .rodata and generate code to load the vector.

For example generate a vector char of 15's  and use that as a shift-count for
shift left quadword 15 bits.


vui8_t
test_splat7_char_15_V1 ()
{
  return vec_splats((unsigned char)15);
}

test_splat7_char_15_V1:
        xxspltib 34,15
        blr

vui128_t
test_slqi_char_15_V1 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splats((unsigned char)15);
  result = vec_slo ((vui8_t) vra, tmp);
  return (vui128_t) vec_vsl (result, tmp);
}

test_slqi_char_15_V1:
        vspltisb 0,15
        vslo 2,2,0
        vsl 2,2,0
        blr

Note that a standalone vec_splats((unsigned char)15) generates:

        xxspltib 34,15

But passing the splatted 15 vector to vec_slo/vec_sll (shift left long
(quadword) 15 ) generated:

        vspltisb 0,15
        vslo 2,2,0
        vsl 2,2,0

Why/how the xxspltib was converted to vspltisb is not clear. For this specific
value (15) this is Ok. The vspltisb can handle 15 (5-bit SIM) as well as
xxspltib (8-bit IMM8).

But it is a bit strange.

Now lets look at some cases where the required (unsigned) constant does not fit
a 5-bit SIM field but fits nicely in the POWER9 xxspltib 8-bit immediate field.
For example:

vui8_t
test_splat7_char_18 ()
{
  return vec_splats((unsigned char)18);
}

test_splat7_char_18:
        xxspltib 34,9
        vaddubm 2,2,2
        blr

The compiler generates the xxspltib but does not believe that the 18 fits into
the immediate field. This is true for vspltisb not for xxspltib. Now use this
constant in a shift left quadword. for example:

vui128_t
test_slqi_char_18_V3 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splats((unsigned char)18);
  result = vec_vslo ((vui8_t) vra, tmp);
  return (vui128_t) vec_vsl (result, tmp);
}

test_slqi_char_18_V3:
        .cfi_startproc
        vspltisb 0,9
        vadduwm 0,0,0
        vslo 2,2,0
        vsl 2,2,0
        blr

Again we see the conversion from 18 to (9 * 2). Not incorrect but not optimal.
For P9 the dependent sequence xxspltib/vslo/vsl would be 9 cycles latency. The
sequence above is 12 cycles.

Now we will look at some larger shift counts for example 116. 

Note: A quadword shift requires a 7-bit shift-count (bits 121:124 for vslo/vsro
and bits 125:127 for vsl/vsr). The 3-bit shift count for vsl/vsr must be
splatted across all 16 bytes. So it is simpler to generate 7 bit shift count
splatted across the bytes and use that for both.

For example:

vui8_t
test_splat1_char_116_V2 ()
{
  return vec_splats ((unsigned char)116);
}

test_splat1_char_116_V2:
        xxspltib 34,116
        blr

Good the compiler generated a single xxspltib. Excellent! And:

vui8_t
test_slqi_char_116_V3 (vui8_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splats((unsigned char)116);
  result = vec_slo (vra, tmp);
  return vec_sll (result, tmp);
}

test_slqi_char_116_V3:
        addis 9,2,.LC15@toc@ha
        addi 9,9,.LC15@toc@l
        lxv 32,0(9)
        vslo 2,2,0
        vsl 2,2,0
        blr

What happened here? It could (should) have been the xxspltib/vslo/vsl sequence
but the compiler when out of its way to generate a vector constant in .rodata
and loads it from storage. This is (9+6=15) cycles minimum (L1 cache hit) as
generated.

We would do better using the POWER8 code sequence. For example:

vui8_t
test_slqi_char_116_V0 (vui8_t vra)
{
  vui8_t result;
   // 116-128 = -12
  vui8_t tmp = vec_splat_u8(-12);
  result = vec_slo (vra, tmp);
  return vec_sll (result, tmp);
}

test_slqi_char_116_V0:
        vspltisb 0,-12
        vslo 2,2,0
        vsl 2,2,0
        blr

This works because the lower 7-bits of -12 (0b11110100) is 0b1110100 == 116
(the processor ignores the high-order bit!). This is (3+3+3=9) cycles minimum
as generated for POWER9.

This trick works for 0-15 and 112-127 (-16 to -1) but gets more complicated for
the range 16-111 which requires 2-5 instructions to generate 7-bit shift counts
for POWER8.

For POWER9 it is always better to generate a xxspltib for vector (unsigned)
char splat (vec_splat_u8() / vec_splats()) and quadword shift counts.

[Bug target/118480] New: Power9 target generates poor code for vector char splat immediate.

Reply via email to