[Bug target/104124] New: Poor optimization for vector splat DW with small consts

munroesj at gcc dot gnu.org via Gcc-bugs Wed, 19 Jan 2022 09:40:41 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124


            Bug ID: 104124
           Summary: Poor optimization for vector splat DW with small
                    consts
           Product: gcc
           Version: 11.1.1
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

It looks to me like the compiler is seeing register pressure caused by loading
all the vector long long constants I need in my code. This is leaf code of a
size it can run out of volatilizes (no stack-frame). But this puts more
pressure on volatile VRs, VSRs, and GPRs. Especially GPRs because it loading
from .rodata when it could (and should) use a vector immediate.

For example:

vui64_t
__test_splatudi_0_V0 (void)
{
  return vec_splats ((unsigned long long) 0);
}

vi64_t
__test_splatudi_1_V0 (void)
{
  return vec_splats ((signed long long) -1);
}

Generate:
00000000000001a0 <__test_splatudi_0_V0>:
     1a0:       8c 03 40 10     vspltisw v2,0
     1a4:       20 00 80 4e     blr

00000000000001c0 <__test_splatudi_1_V0>:
     1c0:       8c 03 5f 10     vspltisw v2,-1
     1c4:       20 00 80 4e     blr
        ...

But other cases that could use immedates like:

vui64_t
__test_splatudi_12_V0 (void)
{
  return vec_splats ((unsigned long long) 12);
}

GCC 9/10/11 Generates for power8:

0000000000000170 <__test_splatudi_12_V0>:
     170:       00 00 4c 3c     addis   r2,r12,0
                        170: R_PPC64_REL16_HA   .TOC.
     174:       00 00 42 38     addi    r2,r2,0
                        174: R_PPC64_REL16_LO   .TOC.+0x4
     178:       00 00 22 3d     addis   r9,r2,0
                        178: R_PPC64_TOC16_HA   .rodata.cst16+0x20
     17c:       00 00 29 39     addi    r9,r9,0
                        17c: R_PPC64_TOC16_LO   .rodata.cst16+0x20
     180:       ce 48 40 7c     lvx     v2,0,r9
     184:       20 00 80 4e     blr

and for Power9:
0000000000000000 <__test_splatisd_12_PWR9>:
       0:       d1 62 40 f0     xxspltib vs34,12
       4:       02 16 58 10     vextsb2d v2,v2
       8:       20 00 80 4e     blr

So why can't the power8 target generate:

00000000000000f0 <__test_splatudi_12_V1>:
      f0:       8c 03 4c 10     vspltisw v2,12
      f4:       4e 16 40 10     vupkhsw v2,v2
      f8:       20 00 80 4e     blr

This is 4 cycles vs 9 ((best case) and it is always 9 cycles because GCC does
not exploit immediate fusion).
In fact GCC 8 (AT12) does this.

So I tried defining my own vec_splatudi:

vi64_t
__test_splatudi_12_V1 (void)
{
  vi32_t vwi = vec_splat_s32 (12);
  return vec_unpackl (vwi);
}

Which generates the <__test_splatudi_12_V1> sequence above for GCC 8. But for
GCC 9/10/11 it generates:

0000000000000110 <__test_splatudi_12_V1>:
     110:       00 00 4c 3c     addis   r2,r12,0
                        110: R_PPC64_REL16_HA   .TOC.
     114:       00 00 42 38     addi    r2,r2,0
                        114: R_PPC64_REL16_LO   .TOC.+0x4
     118:       00 00 22 3d     addis   r9,r2,0
                        118: R_PPC64_TOC16_HA   .rodata.cst16+0x20
     11c:       00 00 29 39     addi    r9,r9,0
                        11c: R_PPC64_TOC16_LO   .rodata.cst16+0x20
     120:       ce 48 40 7c     lvx     v2,0,r9
     124:       20 00 80 4e     blr

Again! GCC has gone out of its way to be this clever! Badly! While it can be
appropriately clever for power9!

I have tried many permutations of this and the only way I have found to prevent
this (GCC 9/10/11) cleverness is to use inline __asm (which has other bad side
effects).

[Bug target/104124] New: Poor optimization for vector splat DW with small consts

Reply via email to