[Bug target/118480] Power9 target generates poor code for vector char splat immediate.

munroesj at gcc dot gnu.org via Gcc-bugs Tue, 14 Jan 2025 11:52:50 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118480


--- Comment #1 from Steven Munroe <munroesj at gcc dot gnu.org> ---

Strangely the ticks that seem to work for positive immediate values (see
test_slqi_char_18_V3 above) fail (generate and .rodata load) for negative
values. For example the shift count for 110 (110-128 = -18):


vui8_t
test_splat1_char_110_V2 ()
{
  return vec_splats ((unsigned char)110);
}

test_splat1_char_110_V2:
        xxspltib 34,110
        blr

But fails when the vec_splats results is passed to vec_slo/vec_sll:

vui128_t
test_slqi_char_110_V3 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splats((unsigned char)110);
  result = vec_vslo ((vui8_t) vra, tmp);
  return (vui128_t) vec_vsl (result, tmp);
}

test_slqi_char_110_V3:
        addis 9,2,.LC9@toc@ha
        addi 9,9,.LC9@toc@l
        lxv 32,0(9)
        vslo 2,2,0
        vsl 2,2,0
        blr

Strangely GCC playes along with the even (but negative) numbers trick. For
example:

vui8_t
test_splat7_char_110_V0 ()
{ // 110-128 = -18
  // (-18 / 2) + (-18 / 2)
  // (-9) + (-9)
  vui8_t tmp = vec_splat_u8(-9);
  return vec_add (tmp, tmp);
}

test_splat7_char_110_V0:
        xxspltib 34,247
        vaddubm 2,2,2
        blr

But fails when this value passed to vec_slo/vec_sll:

vui128_t
test_slqi_char_110_V2 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splat_u8(-9);
  tmp = vec_vaddubm (tmp, tmp);
  result = vec_vslo ((vui8_t) vra, tmp);
  return (vui128_t) vec_vsl (result, tmp);
}

test_slqi_char_110_V2:
        addis 9,2,.LC11@toc@ha
        addi 9,9,.LC11@toc@l
        lxv 32,0(9)
        vslo 2,2,0
        vsl 2,2,0
        blr

Stranger yet, replacing the vaddubm with a shift left 1 

vui8_t
test_splat7_char__110_V4 ()
{ // 110 - 128 = -18 
  // -18 = (-9 * 2) = (-9 << 1)
  vui8_t v1 = vec_splat_u8(1);
  vui8_t tmp = vec_splat_u8(-9);
  return vec_sl (tmp, v1);
}

test_splat7_char__110_V4:
.LFB34:
        .cfi_startproc
        xxspltib 34,247
        vaddubm 2,2,2
        blr

When this is passed to vec_slo/vec_sll, GCC avoids the conversion to .rodata,
but converts the shift back to xxspltib/vaddubm. This is slightly better but
generates an extra (and unnecessary) instruction:

vui8_t
test_slqi_char_110_V4 (vui8_t vra)
{
  vui8_t result;
  // 110 = (-9 * 2) = (-9 << 1)
  vui8_t v1 = vec_splat_u8(1);
  vui8_t tmp = vec_splat_u8(-9);
  tmp = vec_sl (tmp, v1);
  result = vec_slo (vra, tmp);
  return vec_sll (result, tmp);
}

test_slqi_char_110_V4:
.LFB41:
        .cfi_startproc
        xxspltib 32,247
        vaddubm 0,0,0
        vslo 2,2,0
        vsl 2,2,0
        blr

Perhaps we are on to something!
- Avoid negative values
- Use explicit shift instead of add

So one last example generating the 7-bit shift-count as octet (times 8) plus
bit shift and using only positive values:

vui8_t
test_splat7_char_110_V1 ()
{
  // 110 = (13 * 8) + 4
  vui8_t v3 = vec_splat_u8(3);
  vui8_t tmp = vec_splat_u8(13);
  vui8_t tmp2 = vec_splat_u8(6);
  tmp = vec_sl (tmp, v3);
  return vec_add (tmp, tmp2);
}

test_splat7_char_110_V1:
        xxspltib 34,110
        blr

And:

vui8_t
test_slqi_char_110_V5 (vui8_t vra)
{
  vui8_t result;
  // 110 = (13 * 8) + 6
  vui8_t v3 = vec_splat_u8(3);
  vui8_t tmp = vec_splat_u8(13);
  vui8_t tmp2 = vec_splat_u8(6);
  tmp = vec_sl (tmp, v3);
  tmp = vec_add (tmp, tmp2);
  result = vec_slo (vra, tmp);
  return vec_sll (result, tmp);
}

test_slqi_char_110_V5:
        xxspltib 32,110
        vslo 2,2,0
        vsl 2,2,0
        blr

Finally we have a reasonable result that should have been possible with simple
vec_splats((unsigned char)110)!

Note: this looks like a possible workaround for generating vector splatted with
positive constants. It still looks like a problem with negative constants
persists.

[Bug target/118480] Power9 target generates poor code for vector char splat immediate.

Reply via email to