[Bug rtl-optimization/100085] New: Bad code for union transfer from __float128 to vector types

2021-04-14 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

Bug ID: 100085
   Summary: Bad code for union transfer from __float128 to vector
types
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: rtl-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 50595
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50595&action=edit
Reduced example of union and __float128 to vector transfer.

GCC 10/9/8/7 will generate poor (-mcpu=power8) code when using a union to
transfer a __float128 scalar to any vector type. __float128 is a scalar type
and not typecast compatible with any vector type. Despite both being in Vector
registers. 

But for runtime codes implementing __float128 operations for -mcpu=power8 it is
useful (and faster) to perform some (data_class, conversions, etc) operations
directly in vector registers. The only solution for this is to use union to
transfer values between __float128/vector types. This should be a simple vector
register transfer and optimized as such.

But when for GCC and PowerPCle and -mcpu=power8, we are consistently seeing
store/reload sequences. For Power8 this can cause load-hit-store and pipe-line
rejects (33 cycles).

We don't see this when targeting -mcpu=power9, but power9 supports hardware
Float128 instruction. Also we don't see this when targeting BE.

[Bug rtl-optimization/100085] Bad code for union transfer from __float128 to vector types

2021-04-14 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

Steven Munroe  changed:

   What|Removed |Added

 CC||munroesj at gcc dot gnu.org

--- Comment #1 from Steven Munroe  ---
Created attachment 50596
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50596&action=edit
Compile test case fo xfer operation.

Compile for PowerPCle fo both -mcpu=power8 -mfloat128 and -mcpu=power9
-mfloat128 and see the differn asm generated.

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2021-04-16 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #4 from Steven Munroe  ---
I am seeing this a similar problem with union transfers from __float128 to
__int128.


 static inline unsigned __int128
 vec_xfer_bin128_2_int128t (__binary128 f128)
 {
   __VF_128 vunion;

   vunion.vf1 = f128;

   return (vunion.ui1);
 }

and 

unsigned __int128
test_xfer_bin128_2_int128 (__binary128 f128)
{
  return vec_xfer_bin128_2_int128t (f128);
}

generates:

0030 :
  30:   57 12 42 f0 xxswapd vs34,vs34
  34:   20 00 20 39 li  r9,32
  38:   d0 ff 41 39 addir10,r1,-48
  3c:   99 4f 4a 7c stxvd2x vs34,r10,r9
  40:   f0 ff 61 e8 ld  r3,-16(r1)
  44:   f8 ff 81 e8 ld  r4,-8(r1)
  48:   20 00 80 4e blr

For POWER8 should use mfvsrd/xxpermdi/mfvsrd.

This looks like the root cause of poor performance for __float128 soft-float on
POWER8. A simple benchmark using __float128 in C code calling libgcc for
-mcpu=power8 and then hardware instructions for -mcpu=power9.

P8 target P8AT14, Uses libgcc __addkf3_sw and __mulkf3_sw:
test_time_f128 f128 CC  tb delta = 52589, sec = 0.000102713

P9 Target P8AT14, Uses libgcc __addkf3_hw and __mulkf3_hw:
test_time_f128 f128 CC  tb delta = 18762, sec = 3.66445e-05

P9 Target P9AT14, inline hardware binary128 float:
test_time_f128 f128 CC  tb delta = 3809, sec = 7.43945e-06

I used Valgrind Itrace and Sim-ppc and perfstat analysis. Every call to libgcc
__add/sub/mul/divkf3 takes a load-hit-store flush every call. This explains why
__float128 is so 13.8 X slower on P8 then P9.

[Bug target/98519] rs6000: @pcrel unsupported on this instruction error in pveclib

2021-01-04 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98519

--- Comment #5 from Steven Munroe  ---
I would think you need to look at the instruction and the "m" constraint.

In this case lxsd%X1 would need to be converted to plxsd and the "m" constraint
would have to allow @pcrel. I would think a static variable would be valid, but
stack local or explicit pointer with (nonconst) offset/index would not.

[Bug target/98519] rs6000: @pcrel unsupported on this instruction error in pveclib

2021-01-04 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98519

--- Comment #7 from Steven Munroe  ---
Then you have problem as @pcrel is never valid for an instruction like lxsd%X1.

Seems like you will need a new constrain or modifier specific to @pcrel.

[Bug middle-end/99293] New: Built-in vec_splat generates sub-optimal code for -mcpu=power10

2021-02-26 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99293

Bug ID: 99293
   Summary: Built-in vec_splat generates sub-optimal code for
-mcpu=power10
   Product: gcc
   Version: 10.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 50263
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50263&action=edit
Simplified test case

While adding code to Power Vector Library (PVECLIB), for the POWER10 target, I
see strange code generation for Altivec built-in vec_splat for the vector long
long type. I would expect a xxpermdi (xxspltd) based on the "Power Vector
Intrinsic Programming Reference".

But I see the following generated:

0300 :
 300:   67 02 69 7c mfvsrld r9,vs35
 304:   67 4b 09 7c mtvsrdd vs32,r9,r9
 308:   05 00 42 10 vrlqv2,v2,v0
 30c:   20 00 80 4e blr

While these seems to functionally correct, the trip through the GPR seems
unnecessary. It requires two serially dependent instructions where a single
xxspltd would do. I expected:

0300 :
 300:   57 1b 63 f0 xxspltd vs35,vs35,1
 304:   05 18 42 10 vrlqv2,v2,v3
 308:   20 00 80 4e blr


The compiler was:

Compiler: gcc version 10.2.1 20210104 (Advance-Toolchain 14.0-2) [2093e873bb6c]
(GCC)

[Bug middle-end/99293] Built-in vec_splat generates sub-optimal code for -mcpu=power10

2021-02-26 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99293

--- Comment #1 from Steven Munroe  ---
Created attachment 50264
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=50264&action=edit
Compile test for simplied test case

Download vec_dummy.c and vec_int128_ppc.h into a local directory and compile

gcc -O3 -mcpu=power10 -m64 -c vec_dummy.c

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2021-04-29 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #5 from Steven Munroe  ---
Any progress on this?

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2021-06-10 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #13 from Steven Munroe  ---
"We want to use plain TImode instead of V1TImode on newer cpus."

Actually I disagree. We have vector __int128 in the ABI and with POWER10 a
complete set arithmetic operations for 128-bit in VRs.

Also this issue is not restricted to TImode. It also effects _Float128
(KFmode), _ibm128 (TFmode) and Libmvec for vector float/double. The proper and
optimum handling of these "union transfers" has been broken in GCC for years.

And I have grave reservations about the vague plans of small/fringe minority to
subset the PowerISA for their convenience.

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2022-02-24 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #16 from Steven Munroe  ---
Created attachment 52510
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52510&action=edit
Reduced tests for xfers from _float128 to vector or __int128

Cover more types including __int128 and vector __int128

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2022-02-24 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

Steven Munroe  changed:

   What|Removed |Added

 Status|RESOLVED|REOPENED
 Resolution|FIXED   |---

--- Comment #17 from Steven Munroe  ---
I don't think this is fixed.

The fix was supposed to be back-ported to GCC11 for Advance Toolchain 15.

The updated test case shoes that this is clearly not working as advertised.

Either GCC12 fix has regressed due to subsequent updates or the AT15 GCC11
back-port fails due to some missing/different code between GCC11/12.

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2022-02-25 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #21 from Steven Munroe  ---
Yes I was told by Peter Bergner that the fix from
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085#c15 had been back ported
top AT15.0-1.

But when ran this test with AT15.0-1 I saw:

 :
   0:   20 00 20 39 li  r9,32
   4:   d0 ff 41 39 addir10,r1,-48
   8:   57 12 42 f0 xxswapd vs34,vs34
   c:   99 4f 4a 7c stxvd2x vs34,r10,r9
  10:   ce 48 4a 7c lvx v2,r10,r9
  14:   20 00 80 4e blr

0030 :
  30:   20 00 20 39 li  r9,32
  34:   d0 ff 41 39 addir10,r1,-48
  38:   57 12 42 f0 xxswapd vs34,vs34
  3c:   99 4f 4a 7c stxvd2x vs34,r10,r9
  40:   ce 48 4a 7c lvx v2,r10,r9
  44:   20 00 80 4e blr

0060 :
  60:   20 00 20 39 li  r9,32
  64:   d0 ff 41 39 addir10,r1,-48
  68:   57 12 42 f0 xxswapd vs34,vs34
  6c:   99 4f 4a 7c stxvd2x vs34,r10,r9
  70:   99 4e 4a 7c lxvd2x  vs34,r10,r9
  74:   57 12 42 f0 xxswapd vs34,vs34
  78:   20 00 80 4e blr

0090 :
  90:   57 12 42 f0 xxswapd vs34,vs34
  94:   20 00 40 39 li  r10,32
  98:   d0 ff 01 39 addir8,r1,-48
  9c:   f0 ff 21 39 addir9,r1,-16
  a0:   99 57 48 7c stxvd2x vs34,r8,r10
  a4:   00 00 69 e8 ld  r3,0(r9)
  a8:   08 00 89 e8 ld  r4,8(r9)
  ac:   20 00 80 4e blr

So either the patch for AT15.0-1 is not applied correctly or is non-functional
because of some difference between GCC11/GCC12. Or regressed because of some
other change/patch.

In my experience this part of GCC is fragile (based on the long/sad history of
IBM long double). So this needs to monitored with each new update.

[Bug target/100085] Bad code for union transfer from __float128 to vector types

2022-02-26 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100085

--- Comment #23 from Steven Munroe  ---
Ok, but I strongly recommend a compiler test that verify that the compiler is
generating the expected code (for this and other cases).

We have a history of common code changes (accidental or deliberate) causing
regressions for POWER targets.

Best to find these early, before they impact customer performance.

[Bug c/106755] New: Incorrect code gen for altivec intrinsics with constant inputs

2022-08-26 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106755

Bug ID: 106755
   Summary: Incorrect code gen for altivec intrinsics with
constant inputs
   Product: gcc
   Version: 12.2.1
Status: UNCONFIRMED
  Severity: blocker
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 53514
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=53514&action=edit
Reducted test case for vec_muludq() with make

Compiling the PVECLIB project V1.0.4-4 fails unit test (make check) when
compiled with GCC 12 on Fedora 36/37.

Two units test for:
Vector Multiply Unsigned Double Quadword. vec_muludq()
and
Vector Multiply-Add Unsigned Quadword. vec_madduq()

The tests that fail are passing local vector constants to in-lined instants of
these functions.

Current status; the PVECLIB package is blocked for Fedora 37 because it will
not compile with the default GCC-12 compiler.

[Bug target/104124] New: Poor optimization for vector splat DW with small consts

2022-01-19 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124

Bug ID: 104124
   Summary: Poor optimization for vector splat DW with small
consts
   Product: gcc
   Version: 11.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

It looks to me like the compiler is seeing register pressure caused by loading
all the vector long long constants I need in my code. This is leaf code of a
size it can run out of volatilizes (no stack-frame). But this puts more
pressure on volatile VRs, VSRs, and GPRs. Especially GPRs because it loading
from .rodata when it could (and should) use a vector immediate.

For example:

vui64_t
__test_splatudi_0_V0 (void)
{
  return vec_splats ((unsigned long long) 0);
}

vi64_t
__test_splatudi_1_V0 (void)
{
  return vec_splats ((signed long long) -1);
}

Generate:
01a0 <__test_splatudi_0_V0>:
 1a0:   8c 03 40 10 vspltisw v2,0
 1a4:   20 00 80 4e blr

01c0 <__test_splatudi_1_V0>:
 1c0:   8c 03 5f 10 vspltisw v2,-1
 1c4:   20 00 80 4e blr
...

But other cases that could use immedates like:

vui64_t
__test_splatudi_12_V0 (void)
{
  return vec_splats ((unsigned long long) 12);
}

GCC 9/10/11 Generates for power8:

0170 <__test_splatudi_12_V0>:
 170:   00 00 4c 3c addis   r2,r12,0
170: R_PPC64_REL16_HA   .TOC.
 174:   00 00 42 38 addir2,r2,0
174: R_PPC64_REL16_LO   .TOC.+0x4
 178:   00 00 22 3d addis   r9,r2,0
178: R_PPC64_TOC16_HA   .rodata.cst16+0x20
 17c:   00 00 29 39 addir9,r9,0
17c: R_PPC64_TOC16_LO   .rodata.cst16+0x20
 180:   ce 48 40 7c lvx v2,0,r9
 184:   20 00 80 4e blr

and for Power9:
 <__test_splatisd_12_PWR9>:
   0:   d1 62 40 f0 xxspltib vs34,12
   4:   02 16 58 10 vextsb2d v2,v2
   8:   20 00 80 4e blr

So why can't the power8 target generate:

00f0 <__test_splatudi_12_V1>:
  f0:   8c 03 4c 10 vspltisw v2,12
  f4:   4e 16 40 10 vupkhsw v2,v2
  f8:   20 00 80 4e blr

This is 4 cycles vs 9 ((best case) and it is always 9 cycles because GCC does
not exploit immediate fusion).
In fact GCC 8 (AT12) does this.

So I tried defining my own vec_splatudi:

vi64_t
__test_splatudi_12_V1 (void)
{
  vi32_t vwi = vec_splat_s32 (12);
  return vec_unpackl (vwi);
}

Which generates the <__test_splatudi_12_V1> sequence above for GCC 8. But for
GCC 9/10/11 it generates:

0110 <__test_splatudi_12_V1>:
 110:   00 00 4c 3c addis   r2,r12,0
110: R_PPC64_REL16_HA   .TOC.
 114:   00 00 42 38 addir2,r2,0
114: R_PPC64_REL16_LO   .TOC.+0x4
 118:   00 00 22 3d addis   r9,r2,0
118: R_PPC64_TOC16_HA   .rodata.cst16+0x20
 11c:   00 00 29 39 addir9,r9,0
11c: R_PPC64_TOC16_LO   .rodata.cst16+0x20
 120:   ce 48 40 7c lvx v2,0,r9
 124:   20 00 80 4e blr

Again! GCC has gone out of its way to be this clever! Badly! While it can be
appropriately clever for power9!

I have tried many permutations of this and the only way I have found to prevent
this (GCC 9/10/11) cleverness is to use inline __asm (which has other bad side
effects).

[Bug target/104124] Poor optimization for vector splat DW with small consts

2022-01-19 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124

Steven Munroe  changed:

   What|Removed |Added

 CC||munroesj at gcc dot gnu.org

--- Comment #1 from Steven Munroe  ---
Created attachment 52236
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52236&action=edit
Attempts to load small int consts to vector DW via splat

Multiple attempt to convince GCC to load small integer (-16 - 15) constants via
splat. Current GCC versions (9/10/11) convert vec_splats() and
explicit vec_splat_s32/vec_unpackl sequences into to loads from .rodata. This
generates more instruction, takes more cycles, and causes register pressure
that results in unnecessary spill/reload and load-hit-store rejects.

[Bug target/104124] Poor optimization for vector splat DW with small consts

2022-01-27 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124

Steven Munroe  changed:

   What|Removed |Added

  Attachment #52236|0   |1
is obsolete||

--- Comment #2 from Steven Munroe  ---
Created attachment 52307
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=52307&action=edit
Enhansed test case that also shows CSE failure

Original test case that adds example where CSE should common a splat immediate
or even .rodata load, but fails to do even that.

[Bug c/110795] New: Bad code gen for vector compare booleans

2023-07-24 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110795

Bug ID: 110795
   Summary: Bad code gen for vector compare booleans
   Product: gcc
   Version: 13.1.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 55626
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55626&action=edit
Test examples for vector code combinining vector compare combined with logical
or,

Combining a vec_cmplt and vec_cmpge with vector logical OR misscompiles.
For example:
  // Capture the carry t as a bool using signed compare
  t = vec_cmplt ((vi32_t) x, zeros);
  ge = vec_cmpge (x, z);
  // Combine t with (x >= z) for 33-bit compare
  t  = vec_or (ge, t);

This seems to work for the minimized example above but fails when used in the
more complex loop of the example vec_divduw_V1. At -O3 the compiler elides any
code generated for vec_cmplt.

With this bug the function vec_divduw_V1 (Vector_Divide double (words by)
unsigned word) fails the unit test.

[Bug target/110795] Bad code gen for vector compare booleans

2023-07-24 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110795

--- Comment #1 from Steven Munroe  ---
Created attachment 55627
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=55627&action=edit
Main and unit-test. When compiled and linked with vec_divide.c will verify if
the divide code is correct or not.

[Bug target/110795] Bad code gen for vector compare booleans

2023-07-24 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110795

--- Comment #2 from Steven Munroe  ---
Also fails with gcc11/12. Also fails with Advance Toolchain 10.0 GCC 6.4.1.

It might fail for all version between GCC 6 and 13.

[Bug target/110795] Bad code gen for vector compare booleans

2023-07-28 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110795

--- Comment #5 from Steven Munroe  ---
Thanks, sorry I missed the obvious.

[Bug target/111645] New: Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128

2023-09-29 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645

Bug ID: 111645
   Summary: Intrinsics vec_sldb /vec_srdb fail with __vector
unsigned __int128
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 56018
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56018&action=edit
example of the problem. Compile with  gcc -m64 -O3 -mcpu=power10 -c sldbi.c

GCC 12 and 13 fail to compile vector intrisic vec_sldb / vec_srdb as required
by the Power Vector Intrinsic Programming Reference.

error: invalid parameter combination for AltiVec intrinsic ‘__builtin_vec_sldb’

Both the Programming Reference and the GCC documentation state that vector
(unsigned/signed) __int128 are valid operands. But they fail with a 

error: invalid parameter combination for AltiVec intrinsic ‘__builtin_vec_sldb’
or
error: invalid parameter combination for AltiVec intrinsic ‘__builtin_vec_srdb’

[Bug target/111645] Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128

2023-09-30 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645

Steven Munroe  changed:

   What|Removed |Added

  Attachment #56018|0   |1
is obsolete||

--- Comment #2 from Steven Munroe  ---
Created attachment 56019
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=56019&action=edit
Updated test case with static inline functions

[Bug target/111645] Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128

2023-09-30 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645

--- Comment #3 from Steven Munroe  ---
(In reply to Peter Bergner from comment #1)
> I see that we have created built-in overloads for signed and unsigned vector
> char through vector long long.  That said, the rs6000-builtins.def only
> seems to support the signed vector types though, which is why you're seeing
> an error.  So confirmed.
> 
> That said, I believe your 3rd argument needs to be a real constant integer,
> since the vsldbi instruction requires that.  It doesn't allow for a const
> int variable.  I notice some older (not trunk) gcc versions are ICEing with
> that, so another bug to look at.
The original code is static inline, so the const int parm should transfer
intact to the builtin const.

It seems I over-simplified the deduced test case.
> 
> I do not see any documentation that says we support the vector __int128
> type.  Where exactly did you see that?  However, from the instruction
> description, it seems like the hw instruction could support that.

I stand corrected. The documentation only describes vector unsigned long long.
But the instruction is like vsldoi and does not really care what the type is.

[Bug target/111645] Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128

2023-10-01 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645

--- Comment #4 from Steven Munroe  ---
Actually shift/rotate intrinsic: ,vec_rl, vec_rlmi, vec_rlnm, vec_sl, vec_sr,
vec_sra

Support vector __int128 as required for the PowerISA 3.1 POWER vector
shift/rotate quadword instructions 

But: vec_sld, vec_sldb, vec_sldw, vec_sll, vec_slo, vec_srdb, vec_srl, vec_sro

Do not. 

There is no obvious reason for this inconstancy as the target instructions are
effectively 128/256-bit operations returning a 128-bit result.The type of the
inputs is incidental to the operation.

Any restrictions imposed by the original Altivec.h PIM was broken long ago by
VSX and PowerISA 2.07.

Net: the Power Vector Intrinsic Programming Reference and the compilers should
support the vector __int128 type for any instruction where it makes sense as a
input or result.

[Bug target/111645] Intrinsics vec_sldb /vec_srdb fail with __vector unsigned __int128

2023-10-25 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111645

--- Comment #6 from Steven Munroe  ---
(In reply to Carl Love from comment #5)
> There are a couple of issues with the test case in the attachment.  For
> example one of the tests is:
> 
> 
> static inline vui64_t
> vec_vsldbi_64 (vui64_t vra, vui64_t vrb, const unsigned int shb)
> {
>  return vec_sldb (vra, vrb, shb);
> }
> 
> When I tried to compile it, it seemed to compile.  However if I take off the
> static inline, then I get an error about in compatible arguments.  The
> built-in requires an explicit integer be based in the third argument.  The
> following worked for me:
> 
> 
> static inline vui64_t
> vec_vsldbi_64 (vui64_t vra, vui64_t vrb, const unsigned int shb)
> {
>  return vec_sldb (vra, vrb, 1);
> }
> 
> The compiler/assembler needs an explicit value for the third argument as it
> has to generate the instruction with the immediate shift value as part of
> the instruction.  Hence a variable for the third argument will not work.
> 
> Agreed that the __int128 arguments can and should be supported.  Patch to
> add that support is in progress but will require getting the LLVM/OpenXL
> team to agree to adding the __128int variants as well.

Yes I know. in the PVECLIB case these functions will always be static inline.
So this is not issue for me.

[Bug target/104124] Poor optimization for vector splat DW with small consts

2023-06-28 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104124

--- Comment #5 from Steven Munroe  ---
Thanks

[Bug c/116004] New: PPC64 vector Intrinsic vec_first_mismatch_or_eos_index generates poor code

2024-07-19 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116004

Bug ID: 116004
   Summary: PPC64 vector Intrinsic vec_first_mismatch_or_eos_index
generates poor code
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

GCC13 generates the following code for the intrinsic
vec_first_mismatch_or_eos_index -mcpu=power9 -O3:

00c0 :
  c0:   d1 02 00 f0 xxspltib vs32,0
  c4:   07 00 22 10 vcmpneb v1,v2,v0
  c8:   07 00 03 10 vcmpneb v0,v3,v0
  cc:   07 19 42 10 vcmpnezb v2,v2,v3
  d0:   17 04 21 f0 xxland  vs33,vs33,vs32
  d4:   57 0d 42 f0 xxlorc  vs34,vs34,vs33
  d8:   02 16 61 10 vctzlsbb r3,v2
  dc:   b4 07 63 7c extsw   r3,r3
  e0:   20 00 80 4e blr

The use of vcmpneb to compare for EOS is redundant to the vcmpnezb instruction
(which includes the EOS compares). The additional xxland/xxorc logic is only
necessary because of the extra vcmpneb compares.

All you need is a single vcmpnezb as it already handles the a/b mismatch and
EOS tests for both operands. For example:

0070 :
  70:   07 19 42 10 vcmpnezb v2,v2,v3
  74:   02 16 61 10 vctzlsbb r3,v2
  78:   b4 07 63 7c extsw   r3,r3
  7c:   20 00 80 4e blr

[Bug target/116004] PPC64 vector Intrinsic vec_first_mismatch_or_eos_index generates poor code

2024-07-19 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116004

--- Comment #1 from Steven Munroe  ---
Compile test code examples:

int
test_intrn_first_mismatch_or_eos_index_PWR9 (vui8_t vra, vui8_t vrb)
{
  return vec_first_mismatch_or_eos_index (vra, vrb);
}

int
test_first_mismatch_byte_or_eos_index_PWR9 (vui8_t vra, vui8_t vrb)
{
  vui8_t abnez;
  int result;

  abnez  = vec_cmpnez (vra, vrb);
  result = vec_cntlz_lsbb (abnez);
  return result;
}

[Bug target/116004] PPC64 vector Intrinsic vec_first_mismatch_or_eos_index generates poor code

2024-07-19 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116004

--- Comment #2 from Steven Munroe  ---
Actually:

  abnez  = (vui8_t) vec_cmpnez (vra, vrb);
  result = vec_cntlz_lsbb (abnez);

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

2024-10-11 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

--- Comment #4 from Steven Munroe  ---
Created attachment 59323
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59323&action=edit
Examples doe Vector DW int constant

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

2024-10-08 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

--- Comment #3 from Steven Munroe  ---
I tested the attached example source on GCC 14.0.1 from Ubuntu on powerpc64le.

Seeing the same results. So add GCC 14.0.1 to this list. Actually the last GCC
version that did not have this bug was GCC 7. Looks like GCC 8-14 all do this.

I don't have the time or stamina to build GCC from source head right now.

But any one can try using the attached same source.

gcc -O3 -mcpu=power8 -c vec-shift32-const.c

Then objdump and look for any lvx instructions. There should be none.

[Bug target/117007] New: Poor optimiation for small vector constants needed for vector shift/rotate/mask genration.

2024-10-07 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

Bug ID: 117007
   Summary: Poor optimiation for small vector constants needed for
vector shift/rotate/mask genration.
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 59291
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59291&action=edit
compile withe -m64 -O3 -mcpu=power8 or power9

For vector library codes there is frequent need toe "splat" small integer
constants needed for vector shifts, rotates, and mask generation. The
instructions exist (i.e. vspltisw, xxspltib, xxspltiw) supported by intrinsic.

But when these are used to provide constants VRs got other vector operations
the compiler goes out of is way to convert them to vector loads from .rodata.

This is especially bad for power8/9 as .rodata require 32-bit offsets and
always generate 3/4 instructions with a best case (L1 cache hit) latency of 9
cycles. The original splat immediate / shift implementation will run 2-4
instruction (with a good chance for CSE) and 4-6 cycles latency.

For example:

vui32_t
mask_sig_v2 ()
{
  vui32_t ones = vec_splat_u32(-1);
  vui32_t shft = vec_splat_u32(9);
  return vec_vsrw (ones, shft);
}

With GCC V6 generates:

01c0 :
 1c0:   8c 03 09 10 vspltisw v0,9
 1c4:   8c 03 5f 10 vspltisw v2,-1
 1c8:   84 02 42 10 vsrwv2,v2,v0
 1cc:   20 00 80 4e blr


While with GCC 13.2.1 generates:

01c0 :
 1c0:   00 00 4c 3c addis   r2,r12,0
1c0: R_PPC64_REL16_HA   .TOC.
 1c4:   00 00 42 38 addir2,r2,0
1c4: R_PPC64_REL16_LO   .TOC.+0x4
 1c8:   00 00 22 3d addis   r9,r2,0
1c8: R_PPC64_TOC16_HA   .rodata.cst16+0x20
 1cc:   00 00 29 39 addir9,r9,0
1cc: R_PPC64_TOC16_LO   .rodata.cst16+0x20
 1d0:   ce 48 40 7c lvx v2,0,r9
 1d4:   20 00 80 4e blr

this is the samel for -mcpu=power8/power9

it gets worse for vector functions that require multiple shift/mask constants.

For example:

// Extract the float sig
vui32_t
test_extsig_v2 (vf32_t vrb)
{
  const vui32_t zero = vec_splat_u32(0);
  const vui32_t sigmask = mask_sig_v2 ();
  const vui32_t expmask = mask_exp_v2 ();
#if 1
  vui32_t ones = vec_splat_u32(-1);
  const vui32_t hidden = vec_sub (sigmask, ones);
#else
  const vui32_t hidden = mask_hidden_v2 ();
#endif
  vui32_t exp, sig, normal;

  exp = vec_and ((vui32_t) vrb, expmask);
  normal = vec_nor ((vui32_t) vec_cmpeq (exp, expmask),
(vui32_t) vec_cmpeq (exp, zero));
  sig = vec_and ((vui32_t) vrb, sigmask);
  // If normal merger hidden-bit the sig-bits
  return (vui32_t) vec_sel (sig, normal, hidden);
}

GCC V6 generated:
0310 :
 310:   8c 03 bf 11 vspltisw v13,-1
 314:   8c 03 37 10 vspltisw v1,-9
 318:   8c 03 60 11 vspltisw v11,0
 31c:   06 0a 0d 10 vcmpgtub v0,v13,v1
 320:   84 09 00 10 vslwv0,v0,v1
 324:   8c 03 29 10 vspltisw v1,9
 328:   17 14 80 f1 xxland  vs44,vs32,vs34
 32c:   84 0a 2d 10 vsrwv1,v13,v1
 330:   86 00 0c 10 vcmpequw v0,v12,v0
 334:   86 58 8c 11 vcmpequw v12,v12,v11
 338:   80 6c a1 11 vsubuwm v13,v1,v13
 33c:   17 14 41 f0 xxland  vs34,vs33,vs34
 340:   17 65 00 f0 xxlnor  vs32,vs32,vs44
 344:   7f 03 42 f0 xxsel   vs34,vs34,vs32,vs45
 348:   20 00 80 4e blr

While GCC 13.2.1 -mcpu=power8 generates:
360 :
 360:   00 00 4c 3c addis   r2,r12,0
360: R_PPC64_REL16_HA   .TOC.
 364:   00 00 42 38 addir2,r2,0
364: R_PPC64_REL16_LO   .TOC.+0x4
 368:   00 00 02 3d addis   r8,r2,0
368: R_PPC64_TOC16_HA   .rodata.cst16+0x30
 36c:   00 00 42 3d addis   r10,r2,0
36c: R_PPC64_TOC16_HA   .rodata.cst16+0x20
 370:   8c 03 a0 11 vspltisw v13,0
 374:   00 00 08 39 addir8,r8,0
374: R_PPC64_TOC16_LO   .rodata.cst16+0x30
 378:   00 00 4a 39 addir10,r10,0
378: R_PPC64_TOC16_LO   .rodata.cst16+0x20
 37c:   00 00 22 3d addis   r9,r2,0
37c: R_PPC64_TOC16_HA   .rodata.cst16+0x40
 380:   e4 06 4a 79 rldicr  r10,r10,0,59
 384:   ce 40 20 7c lvx v1,0,r8
 388:   00 00 29 39 addir9,r9,0
388: R_PPC64_TOC16_LO   .rodata.cst16+0x40
 38c:   8c 03 17 10 vspltisw v0,-9
 390:   98 56 00 7c lxvd2x  vs0,0,r10
 394:   e4 06 29 79 rldicr  r9,r9,0,59
 398:   98 4e 80 7d lxvd2x  vs12,0,r9
 39c:   84 01 21 10 vslwv1,v1,v0
 3a0:   50 02 00 f0 xxswapd vs0,vs0
 3a4:   17 14 01 f0 xxland  vs32,vs33,vs34
 3a8:   50 62 8c f1 xxswapd vs12,vs12
 

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

2024-10-29 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

--- Comment #10 from Steven Munroe  ---
(In reply to Segher Boessenkool from comment #7)
> It is always more and slower code.  Yes.

More examples:

vui64_t
test_sld_52_v1 (vui64_t vra)
{
  vui32_t shft = vec_splat_u32(52-64);
  return vec_vsld (vra, (vui64_t) shft);
}

vui64_t
test_sld_52_v0 (vui64_t vra)
{
  return vra << 52;
}

The PowerISA is challenged to generate an vector doubleword constant. So it
seems easier to load such constants from .rodata. Again a load from .rodata is
minimum of 3 instructions and latency of 9 cycles (L1 cache hit).

But there are many examples of vector doubleowrd operations that need small
constants. Also the doubleword shift/rotate operations only require a 6-bit
shift count. Here changing the vector shift intrinsic to accept vector unsigned
char for the shift count would be helpful.

It is often faster to generate these constants from existing splat immediate
instructions and 1-2 other operations than to pay the full latency cost of a
(.rodata) vector load.

For power8 the current GCC compilers will take this option away of the library
developer. For example:

gcc-13 -O3 -mcpu=power8 -mtune=power8
01e0 :  #TL 11/11
 1e0:   00 00 4c 3c addis   r2,r12,.TOC.@ha
 1e4:   00 00 42 38 addir2,r2,.TOC.@l
 1e8:   00 00 22 3d addis   r9,r2,.rodata.cst16@ha  #L 2/2
 1ec:   00 00 29 39 addir9,r9,.rodata.cst16@l   #L 2/2
 1f0:   ce 48 00 7c lvx v0,0,r9 #L 5/5
 1f4:   c4 05 42 10 vsldv2,v2,v0#L 2/2
 1f8:   20 00 80 4e blr

01b0 :  #TL 11/11
 1e0:   00 00 4c 3c addis   r2,r12,.TOC.@ha
 1e4:   00 00 42 38 addir2,r2,.TOC.@l
 1e8:   00 00 22 3d addis   r9,r2,.rodata.cst16@ha  #L 2/2
 1ec:   00 00 29 39 addir9,r9,.rodata.cst16@l   #L 2/2
 1c0:   ce 48 00 7c lvx v0,0,r9 #L 5/5
 1c4:   c4 05 42 10 vsldv2,v2,v0#L 2/2
 1c8:   20 00 80 4e blr

While the original Power64LE support compilers would allow the library
developer to  use intrinsics to generation smaller/faster sequences. Again the
PowerISA vector shift/Rotate doubleword operations only needs the low-order
6-bits for the shift count. Here the original altivec vec_splat_u32() can
generate shift-counts for ranges 0-15 and 48-63 easily. Or if the vector
shift/rotate intrinsics would accept vector unsigned char for the shift count
the library developer could use vec_splat_u8().

gcc-6 -O3 -mcpu=power8 -mtune=power8
0170 :  #TL 4/4
 170:   8c 03 14 10 vspltisw v0,-12 #L 2/2
 174:   c4 05 42 10 vsldv2,v2,v0#L 2/2
 178:   20 00 80 4e blr

Power 9 has the advantage of VSX Vector Splat Immediate Byte and will use it
the vector inline. But this will alway insert the extend signed byte to
doubleword. The current Power Intrinsic Reference does not provide a direct
mechanism to generate xxspltib. If vec_splat_u32() is the current compiler
(constant propagation?) will convert this into the load vector (lxv this time)
from .rodata. This is still 3 instructions and 9 cycles.

gcc-13 -O3 -mcpu=power9 -mtune=power9
01a0 :  #TL 7/7
 1a0:   d1 a2 01 f0 xxspltib vs32,52#L 3/3
 1a4:   02 06 18 10 vextsb2d v0,v0  #L 2/2
 1a8:   c4 05 42 10 vsldv2,v2,v0#L 2/2
 1ac:   20 00 80 4e blr

 0170 : #TL 11/11
 1e0:   00 00 4c 3c addis   r2,r12,.TOC.@ha
 1e4:   00 00 42 38 addir2,r2,.TOC.@l
 1e8:   00 00 22 3d addis   r9,r2,.rodata.cst16@ha  #L 2/2
 1ec:   00 00 29 39 addir9,r9,.rodata.cst16@l   #L 2/2
 180:   09 00 09 f4 lxv vs32,0(r9)  #L 5/5
 184:   c4 05 42 10 vsldv2,v2,v0#L 2/2
 188:   20 00 80 4e blr

This is still larger and slower then if the compiler/intrinsic would allow the
direct use of xxspltib to generate the shift count for vsld.

gcc-fix -O3 -mcpu=power9 -mtune=power9
0170 :  #TL 5/5
 170:   d1 a2 01 f0 xxspltib vs32,52#L 3/3
 174:   c4 05 42 10 vsldv2,v2,v0#L 2/2
 178:   20 00 80 4e blr

Power10 also generates VSX Vector Splat Immediate Byte and extend sign vector
inline doubleword shift. But it again converts vec_splat_u32() intrinsic into a
load vector (plxv this time) from .rodata. This is smaller and faster then the
power9 sequence but seems a bit of overkill for the small constant (52)
involved.

gcc-13 -O3 -mcpu=power10 -mtune=power10
01d0 :  #TL 7/11
 1d0:   d1 a2 01 f0 xxspltib vs32,52#L 3/4
 1d4:   02 06 18 10 vextsb2d v0,v0  #L 3/4
 1d8:   c4 05 42 10 vsldv2,v2,v0#L 1/3
 1dc:   20 00 80 4e blr

01b0 :   

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

2024-10-29 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

--- Comment #9 from Steven Munroe  ---
(In reply to Segher Boessenkool from comment #7)
> It is always more and slower code.  Yes.

lets try some specific examples and examine the code generated for power8/9/10

vui32_t
test_slw_23_v0 (vui32_t vra)
{
  return vra << 23;
}

vui32_t
test_slw_23_v1 (__vector unsigned int vra)
{
  vui32_t shft = vec_splat_u32(23-32);
  return vec_sl (vra, shft);
}

gcc-13 -O3 -mcpu=power8 -mtune=power8
0100 :  #TL 11/11
 100:   00 00 4c 3c addis   r2,r12,.TOC.@ha
 104:   00 00 42 38 addir2,r2,.TOC.@l
 108:   00 00 22 3d addis   r9,r2,.rodata.cst16@ha  #L 2/2
 10c:   00 00 29 39 addir9,r9,.rodata.cst16@l   #L 2/2
 110:   ce 48 00 7c lvx v0,0,r9 #L 5/5
 114:   84 01 42 10 vslwv2,v2,v0#L 2/2
 118:   20 00 80 4e blr

00e0 :  #TL 4/4
  e0:   8c 03 17 10 vspltisw v0,-9  #L 2/2
  e4:   84 01 42 10 vslwv2,v2,v0#L 2/2
  e8:   20 00 80 4e blr

For inline vector gcc tends to generate load from .rodata. The addis/addi/lvx
(3 instruction) sequence is always generated for medium memory model. Only the
linker will know the final offset so there is no optimization. This is a
dependent sequence and best case (L1 cache hit) 11 cycles latency.

Using the vector unsigned int type and intrinsic vec_splat_u32()/vec_sl()
sequence generates to two instructions (vspltisw/vslw) for this simple case for
this simple case.
Again a dependent sequence for 4 cycles total. 4 cycles beats 11

gcc-13 -O3 -mcpu=power9 -mtune=power9
0100 :  #TL 7/7
 100:   d1 ba 00 f0 xxspltib vs32,23#L 3/3
 104:   02 06 10 10 vextsb2w v0,v0  #L 2/2
 108:   84 01 42 10 vslwv2,v2,v0#L 2/2
 10c:   20 00 80 4e blr

 00e0 : #TL 5/5
  e0:   8c 03 17 10 vspltisw v0,-9  #L 3/3
  e4:   84 01 42 10 vslwv2,v2,v0#L 2/2
  e8:   20 00 80 4e blr

Power 9 has the advantage of VSX Vector Splat Immediate Byte and will use it
the vector inline. The disadvantage is the it is a byte splat for a word shift.
To the compiler insert the (pedantic) Expand Byte to Word. This adds 1
instruction and 2 cycles latency to the sequence. 

The ISA for vector shift word only requires the low order 5-bits of each
element for the shift count. So the extend is not required and either vspltisw
or xxspltib will work here. This is an example where changing the vector shift
intrinsic to accept vector unsigned char for the shift count would be helpful.

Again the intrinsic implementation beats the compiler vector inline code by
2-cycle (5 vs 7 cycles) and one less instruction.

gcc-13 -O3 -mcpu=power10 -mtune=power10
0100 :  #TL 4/7
 100:   00 00 00 05 xxspltiw vs32,23#L 3/4
 104:   17 00 07 80 
 108:   84 01 42 10 vslwv2,v2,v0#L 1/3
 10c:   20 00 80 4e blr

00e0 :  #TL 4/7
  e0:   8c 03 17 10 vspltisw v0,-9  #L 3/4
  e4:   84 01 42 10 vslwv2,v2,v0#L 1/3
  e8:   20 00 80 4e blr

Power10 has the advantage of the VSX Vector Splat Immediate Word instruction.
This is a 8-byte prefixed instruction and is overkill for a 5-bit shift count. 

The good news is the cycle latency is the same but adds another word to the
code stream which in not required to generate such a small (5-bit) constant.

However VSX Vector Splat Immediate Word will be excellent for generating
mb/me/sh masks for Vector Rotate Left Word then Mask Insert and the like.

So I will concede the for the shift/rotate word immediate case for power10 the
latencies are comparable.

The problem I see is; as the examples get complex (generating mask for float)
or double/quadword shifts the compiler (CSE or constant propagation) will
convert splat immediate to vector load form .rodata.

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

2024-10-26 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

--- Comment #6 from Steven Munroe  ---
I am starting to see pattern and wonder if the compiler is confused by assuming
the sihft count must match the width/type of the shift/rotate target.

This is implied all the way back to the Altivec-PIM and the current Intrinsic
Reference and the GCC documentation. The intrinsics vec_rl(), vec_sl(),
vec_sr(), vec_sra() all require that the shift-count be the same (unsigned)
type (element size) as the shifted/rotated a value.

This might confuse the compiler into thinking it MUST properly (zero/sign)
extend any shift count. But that is wrong.

But the PowerISA only requires that the shift-count in the (3-7-bits) low-order
bits of each element. And any high-order element bits are don't care.

So the shift-count (operand b) could easily be a vector unsigned char (byte
elements).
In fact the vec_sll(), vec_slo(), vec_srl(), and vec_sro() allow this.

So the compiler can correctly use vspltisb, vspltish, vspltisw, xxspltib, for
any vector shift/rotate where the shift-count is a compiler time constant.

The is always less and faster code then loading vector constants from .rodata.

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

2024-10-26 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

Steven Munroe  changed:

   What|Removed |Added

  Attachment #59323|0   |1
is obsolete||

--- Comment #5 from Steven Munroe  ---
Created attachment 59446
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59446&action=edit
Examples of DW/QW shift immedaite

Also found the compiler mishandling the qquadword shift by a constant for
inline vector. I think is related to the fact that GCC does not actually
support quadword integer constants.

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

2024-11-07 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

--- Comment #12 from Steven Munroe  ---
Is seem like even for small values of signed char vec_splats ((signed char)x)
will sometime generate 2 instruction where it should only generate a single
xxspltib.

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

2024-11-07 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

--- Comment #13 from Steven Munroe  ---
Is seem like even for small values of signed char vec_splats ((signed char)x)
for target -mcpu=power9 will sometime generate 2 instruction where it should
only generate a single xxspltib.

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

2024-11-07 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

--- Comment #11 from Steven Munroe  ---
Created attachment 59560
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59560&action=edit
Test cases for vec_splats(signed chat) on -mcpu=power9

for and valid char value I would expect for example vec_splats ((signed char)
16)
to generate single xxspltib vs34,16.

But I am seeing:
0020 :
  20:   d1 42 40 f0 xxspltib vs34,8
  24:   00 10 42 10 vaddubm v2,v2,v2

[Bug target/117818] vec_add incorrectly generates vadduwm for vector char const inputs.

2024-11-27 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117818

--- Comment #2 from Steven Munroe  ---
Same issues compiled for power9/10

[Bug target/117818] New: vec_add incorrectly generates vadduwm for vector char const inputs.

2024-11-27 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117818

Bug ID: 117818
   Summary: vec_add incorrectly generates vadduwm for vector char
const inputs.
   Product: gcc
   Version: 13.3.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

Created attachment 59731
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59731&action=edit
Test case Vector shift long with const shift count -mcpu=power8

More specifically when the input to vec_add is generated by vec_splat_u8() and
the result of the vec_add is passed into the shift count of some combination of
vec_slo()/vec_sll() the compiler may generate vadduwm instead of vaddubm. This
is very odd given that both vec_slo()/vec_sll() require a vector (un)signed
char for the shift count (VRB).

The combination vec_slo()/vec_sll() supports a 0-127 bit quadword left shift,
while vec_sro()/vec_srl() supports a 0-127 bit quadword right shift. This
requires a 7-bit shift count. Intrinsics vec_slo()/vec_sro() requires the high
4-bits of the shift count in VR bits 121:124. Intrinsics vec_sll()/vec_srl()
requires the low 3-bits of the shift count in VR bits 125:127. However the
PowerISA instructions vsl/vsr (corresponding to vec_sll()/vec_srl()) require
that the shift count be splatted across all 16 bytes of the VRB.

>From the PowerISA descriptions of vsl/vsr:

The result is placed into VRT, except if, for any byte
element in register VRB, the low-order 3 bits are not
equal to the shift amount, then VRT is undefined.

So it makes sense that intrinsics vec_slo()/vec_sll()/vec_sro()/vec_srl()
require type vector char for VRB (shift count).
It also makes sense that const shift counts would be generated by
vec_splat_u8() and possibly some other intrinsic operations to extent the 5-bit
SIM range (-16 to +15) to cover the 7-bit quadword shift counts required.

Note: Loading a const vector from .rodata is extensive compared to short
sequences of vector splat immediate and 0 to 2 additional operations.

For example : 

vui8_t test_splat6_char_18 ()
{
  vui8_t tmp = vec_splat_u8(9);
  return vec_add (tmp, tmp);
}

You expect the compiler to generate:
vspltisb 2,9
vaddubm 2,2,2

And it does.

The quadword shift left by 18 example:

vui128_t test_slqi_char_18_V1 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splat_u8(9);
  // 9 << 1 == 18
  tmp = vec_add (tmp, tmp);
  result = vec_slo ((vui8_t) vra, tmp);
  return (vui128_t) vec_sll (result, tmp);
}

Actually generates:

vspltisb 0,9
vadduwm 0,0,0
vslo 2,2,0
vsl 2,2,0

Note that the expected Vector Add Unsigned Byte Modulo (vaddubm) instruction
has been replaced by Vector Add Unsigned Word Modulo (vadduwm).

Technically the results are not incorrect for this specific value (no carries
into adjacent bytes). But this fail for any of negative values (-16 to -1).
Negative SIM values are required to cover the high range of the 7-bit shift
count.

For example a shift count of 110. 

vui8_t
test_splat6_char_110 ()
{ // 110-128 = -18
  vui8_t tmp = vec_splat_u8(-9);
  return vec_add (tmp, tmp);
}

Add the compiler does generate the expected:
vspltisb 2,-9
vaddubm 2,2,2

Good so far. Now try to shift long by 110 bits:

vui128_t
test_slqi_char_110_V1 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splat_u8(-9);
  tmp = vec_add (tmp, tmp);
  result = vec_slo ((vui8_t) vra, tmp);
  return (vui128_t) vec_sll (result, tmp);
}

A reasonable person would expect the compiler to generate:

vspltisb 0,-9
vaddubm 0,0,0
vslo 2,2,0
vsl 2,2,0

And as recently as GCC 7 it did exactly that. But starting with GCC 8 we get
the tricky word add for positive values and performance sucking .rodata load
for any negative value.

For example GCC 13 generates:

addis 9,2,.LC0@toc@ha
addi 9,9,.LC0@toc@l
lvx 0,0,9
vslo 2,2,0
vsl 2,2,0

where the .rodata const .LC0 is:

.align 4
.LC0:
.long   -286331154
.long   -286331154
.long   -286331154
.long   -286331154

This is correct but unnecessary. It is odd that the .rodata is defined as 4
words (.long) and not 16 bytes (.byte) given that original const is vector
char.

So this is a bug and a regression starting with GCC 8

[Bug target/117818] vec_add incorrectly generates vadduwm for vector char const inputs.

2024-11-27 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117818

--- Comment #1 from Steven Munroe  ---
May be related to 117007

[Bug target/117818] vec_add incorrectly generates vadduwm for vector char const inputs.

2024-11-28 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117818

--- Comment #3 from Steven Munroe  ---
Tried replacing generic vec_add with specific vec_addubm
(__builtin_vec_add/__builtin_vec_vaddubm).

No joy compiler still generates vadduwm and load from ,rodata.

[Bug target/117818] vec_add incorrectly generates vadduwm for vector char const inputs.

2024-11-30 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117818

--- Comment #5 from Steven Munroe  ---
I expected compiling for -mcpu=power9 to do a better job generating splats for
small constants.

Given the new instructions like VSX Vector Splat Immediate Byte (xxspltib) and
Vector Extend Sign Byte To Word/Doubleword the compiler should have a easier
time generating vec_splats(). It would seem that Vector Splat Immediate Byte
would be the perfect way to generate a constant shift quadword left/right.

But that is not what I am seeing. First note there is no direct intrinsic for
xxspltib. It is sometimes generated for vec_splat_u8(0-15) and
vec_splats((vector unsigned char) x). But sometimes it gets weird.

For example:

vui128_t
test_slqi_char_18_V3 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splats((unsigned char)18);
  result = vec_vslo ((vui8_t) vra, tmp);
  return (vui128_t) vec_vsl (result, tmp);
}

Which I would expect to generate:

xxspltib 34,18
vslo 2,2,0
vsl 2,2,0

But generates:

vspltisb 0,9
vadduwm 0,0,0
vslo 2,2,0
vsl 2,2,0

It recognizes that it can't generate 18 with vspltisb and uses the 18 = 9 * 2
pattern. It also erroneously generates vector add word. Seem like GCC is
reusing the old pattern and ignoring the new instructions.

This is weird because:

vui8_t
test_splat6_char_18 ()
{
  vui8_t tmp = vec_splat_u8(9);
  return vec_add (tmp, tmp);
}

Generates:

xxspltib 34,9
vaddubm 2,2,2

But:

vui8_t
test_splat6_char_31 ()
{
  // 31 = (16+15) = (15 - (-16))
  vui8_t v16 = vec_splat_u8(-16);
  vui8_t tmp = vec_splat_u8(15);
  return vec_sub (tmp, v16);
}

Generates:

xxspltib 34,31

Which seems like a miracle. Is this constant propagation?

But:

vui8_t
test_slqi_char_31_V0 (vui8_t vra)
{
  vui8_t result;
  // 31 = (16+15) = (15 - (-16))
  vui8_t v16 = vec_splat_u8(-16);
  vui8_t tmp = vec_splat_u8(15);
  tmp = vec_sub (tmp, v16);
  result = vec_slo (vra, tmp);
  return vec_sll (result, tmp);
}

Generates:

addis 9,2,.LC0@toc@ha
addi 9,9,.LC0@toc@l
lxv 32,0(9)
vslo 2,2,0
vsl 2,2,0

Ok I think I can fix ths with:

vui8_t
test_slqi_char_31_V3 (vui8_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splats((unsigned char)31);
  result = vec_slo (vra, tmp);
  return vec_sll (result, tmp);
}

But no. it still generated:

addis 9,2,.LC0@toc@ha
addi 9,9,.LC0@toc@l
lxv 32,0(9)
vslo 2,2,0
vsl 2,2,0

Which is all very confusing.

[Bug target/117818] vec_add incorrectly generates vadduwm for vector char const inputs.

2024-11-30 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117818

--- Comment #4 from Steven Munroe  ---
Created attachment 59756
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59756&action=edit
Updated Test case Vector shift long with const shift count -mcpu=power9

This is an extension of the original with example using vec_splats() that
generate weird code for power9.

[Bug target/117718] Inefficient address computation for d-form vector loads

2024-12-01 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117718

--- Comment #6 from Steven Munroe  ---
Another issues with vector loads from .rodata 

Some times the compiler will generate this sequence for power8

addis 9,2,.LC69@toc@ha
addi 9,9,.LC69@toc@l
rldicr 9,9,0,59
lxvd2x 12,0,9
xxpermdi 12,12,12,2

GCC seems to generate this when it wants to load into a VSR (0-31) vs a VR. The
latency is 13 cycles!

The rldicr 9,9,0,59 (clrrdi  r9,r9,4) is not required. The data is already
aligned!

The compiler should know this because this a vector constant and TOC relative.
It is not random user data!

The xxpermdi (xxswapd) is needed because lxvd2x is:
 - Endian enabled within the element, but
 - Array order across elements

Unless the data is splatted (DW0 == DW1). The compiler could know this. Likely
the compiler generated this via constant propagation. The compiler should know!

Finally why is the address calculation a dependent sequence that guarantees the
worst possible latency.

addis 9,2,.LC69@toc@ha
li 0,0,.LC69@toc@l
lxvd2x 12,9,0
xxpermdi 12,12,12,2

This allows the addis/addi to execute in parallel and enable instruction
fusion. This sequence is 9 cycles (7 cycles without the xxswapd).

See Section of 10.1.12 Instruction Fusion of the POWER8 Processor User’s
Manual.

The addi/lxvd2x pair can be treated as a (Power8 Tuned) prefix instruction
which is effectively a D-from lxvd2. This fusion form applies to {lxvd2x,
lxvw4x, lxvdsx, lvebx, lvehx, lvewx, lvx, lxsdx} instructions.

Yes this clobbers another register (R0 for 2 instructions) but the faster
sequence can actually reduce register pressure.

[Bug target/117818] [12/13/14/15 regression] vec_add incorrectly generates vadduwm for vector char const inputs.

2025-02-05 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117818

--- Comment #7 from Steven Munroe  ---
(In reply to Richard Biener from comment #6)
> is that powerpc64le or powerpc{,64} big endian?  (or both)

Definitely powerpc64le because few distros support powerpc targets.

I think the lasts GCC I have that supported powerpc was GCC 8.

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

2024-11-22 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

Steven Munroe  changed:

   What|Removed |Added

  Attachment #59291|0   |1
is obsolete||

--- Comment #14 from Steven Munroe  ---
Created attachment 59674
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=59674&action=edit
Uodates 32-bit examples showing diff behaviour across shift/rotates

Replaced vec-shift-const.c version with additional examples.

[Bug target/117007] Poor optimization for small vector constants needed for vector shift/rotate/mask generation

2024-11-22 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117007

--- Comment #15 from Steven Munroe  ---
Found where handling of vec_splat_u32 constant shift counts are handled
differently across the various shift/rotate intrinsics.

Even for the 5-bit shift counts (the easy case) the behavior of the various
shift/rotate intrinsic are inconsistent. The compiler pays way to much
attention to how the shift count is generated but differently between shift
left/right word and different again for rotate left word.

Any reasonable person would assume that using vec_splat_u32() for any shift
value 1 to 31 (-16 to 15) will generate efficient code. And it does for
vec_vslw() which generates two instructions (vspltisw v0,-16; vslw v2,v2,v0).

But the compiler behaves differently for vec_vsrw() and vec_vsraw():
 - for values 1-15 generates:
   - vspltisw v0,15; vsrwv2,v2,v0
 - for even values between 16 - 30
   - vspltisw v0,8; vadduwm v0,v0,v0; vsrwv2,v2,v0
 - for odd values between 17 - 31 generates a load for .rodata

And positively strange for vec_vrlw():
 - for values 1-15 it generates:
   - vspltisw v0,15; vrlwv2,v2,v0
 - but for any value between 16 - 31 it gets strange:
1200 :
1200:   30 00 20 39 li  r9,48
1204:   8c 03 00 10 vspltisw v0,0
1208:   67 01 29 7c mtvrd   v1,r9
120c:   93 0a 21 f0 xxspltw vs33,vs33,1
1210:   80 0c 00 10 vsubuwm v0,v0,v1
1214:   84 00 42 10 vrlwv2,v2,v0
1218:   20 00 80 4e blr

[Bug target/117718] Inefficient address computation for d-form vector loads

2024-11-22 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117718

--- Comment #5 from Steven Munroe  ---
(In reply to Michael Meissner from comment #3)
> No, the issue is with DQ addressing (i.e. vector load/store with offset), we
> can't guarantee that the external address will be properly aligned with the
> bottom 4 bits must be set to 0.

The specific case I seeing is loading const vectors from .rodata. These are
always quadword aligned. The compiler should know this as the offset is .TOC
(R2) relative.

That has to be case of -mcpu=power8 otherwise it could not use lvx.

So it seems reasonable to assume that this is also true for P9/P10.

[Bug target/118480] Power9 target generates poor code for vector char splat immediate.

2025-01-14 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118480

--- Comment #2 from Steven Munroe  ---
Created attachment 60156
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=60156&action=edit
Examples for vector quadword shift by const immediate for POWER9

Compile with gcc -O3 -Wall -S -mcpu=power9 -mtune=power9 -m64

[Bug target/118480] New: Power9 target generates poor code for vector char splat immediate.

2025-01-14 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118480

Bug ID: 118480
   Summary: Power9 target generates poor code for vector char
splat immediate.
   Product: gcc
   Version: 13.2.1
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

POWER9 (PowerISA 3.0C) adds the VSX Vector Splat Immediate Byte (xxspltib)
instruction that is perfect for generating small integer constants for vector
char values. GCC with (sometimes) generates xxspltib, but other times will
inexplicably generate a 1/2 instuction original Altivec (PowerISA 2.03)
sequence OR a vector const in .rodata and generate code to load the vector.

For example generate a vector char of 15's  and use that as a shift-count for
shift left quadword 15 bits.


vui8_t
test_splat7_char_15_V1 ()
{
  return vec_splats((unsigned char)15);
}

test_splat7_char_15_V1:
xxspltib 34,15
blr

vui128_t
test_slqi_char_15_V1 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splats((unsigned char)15);
  result = vec_slo ((vui8_t) vra, tmp);
  return (vui128_t) vec_vsl (result, tmp);
}

test_slqi_char_15_V1:
vspltisb 0,15
vslo 2,2,0
vsl 2,2,0
blr

Note that a standalone vec_splats((unsigned char)15) generates:

xxspltib 34,15

But passing the splatted 15 vector to vec_slo/vec_sll (shift left long
(quadword) 15 ) generated:

vspltisb 0,15
vslo 2,2,0
vsl 2,2,0

Why/how the xxspltib was converted to vspltisb is not clear. For this specific
value (15) this is Ok. The vspltisb can handle 15 (5-bit SIM) as well as
xxspltib (8-bit IMM8).

But it is a bit strange.

Now lets look at some cases where the required (unsigned) constant does not fit
a 5-bit SIM field but fits nicely in the POWER9 xxspltib 8-bit immediate field.
For example:

vui8_t
test_splat7_char_18 ()
{
  return vec_splats((unsigned char)18);
}

test_splat7_char_18:
xxspltib 34,9
vaddubm 2,2,2
blr

The compiler generates the xxspltib but does not believe that the 18 fits into
the immediate field. This is true for vspltisb not for xxspltib. Now use this
constant in a shift left quadword. for example:

vui128_t
test_slqi_char_18_V3 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splats((unsigned char)18);
  result = vec_vslo ((vui8_t) vra, tmp);
  return (vui128_t) vec_vsl (result, tmp);
}

test_slqi_char_18_V3:
.cfi_startproc
vspltisb 0,9
vadduwm 0,0,0
vslo 2,2,0
vsl 2,2,0
blr

Again we see the conversion from 18 to (9 * 2). Not incorrect but not optimal.
For P9 the dependent sequence xxspltib/vslo/vsl would be 9 cycles latency. The
sequence above is 12 cycles.

Now we will look at some larger shift counts for example 116. 

Note: A quadword shift requires a 7-bit shift-count (bits 121:124 for vslo/vsro
and bits 125:127 for vsl/vsr). The 3-bit shift count for vsl/vsr must be
splatted across all 16 bytes. So it is simpler to generate 7 bit shift count
splatted across the bytes and use that for both.

For example:

vui8_t
test_splat1_char_116_V2 ()
{
  return vec_splats ((unsigned char)116);
}

test_splat1_char_116_V2:
xxspltib 34,116
blr

Good the compiler generated a single xxspltib. Excellent! And:

vui8_t
test_slqi_char_116_V3 (vui8_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splats((unsigned char)116);
  result = vec_slo (vra, tmp);
  return vec_sll (result, tmp);
}

test_slqi_char_116_V3:
addis 9,2,.LC15@toc@ha
addi 9,9,.LC15@toc@l
lxv 32,0(9)
vslo 2,2,0
vsl 2,2,0
blr

What happened here? It could (should) have been the xxspltib/vslo/vsl sequence
but the compiler when out of its way to generate a vector constant in .rodata
and loads it from storage. This is (9+6=15) cycles minimum (L1 cache hit) as
generated.

We would do better using the POWER8 code sequence. For example:

vui8_t
test_slqi_char_116_V0 (vui8_t vra)
{
  vui8_t result;
   // 116-128 = -12
  vui8_t tmp = vec_splat_u8(-12);
  result = vec_slo (vra, tmp);
  return vec_sll (result, tmp);
}

test_slqi_char_116_V0:
vspltisb 0,-12
vslo 2,2,0
vsl 2,2,0
blr

This works because the lower 7-bits of -12 (0b0100) is 0b1110100 == 116
(the processor ignores the high-order bit!). This is (3+3+3=9) cycles minimum
as generated for POWER9.

This trick works for 0-15 and 112-127 (-16 to -1) but gets more complicated for
the range 16-111 which requires 2-5 instructions to generate 7-bit shift counts
for POWER8.

For POWER9 it is always better to generate a xxspltib for vector (unsigned)
char splat (vec_splat_u8() / vec_splats()) and quadword shift counts.

[Bug target/118480] Power9 target generates poor code for vector char splat immediate.

2025-01-14 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118480

--- Comment #1 from Steven Munroe  ---

Strangely the ticks that seem to work for positive immediate values (see
test_slqi_char_18_V3 above) fail (generate and .rodata load) for negative
values. For example the shift count for 110 (110-128 = -18):


vui8_t
test_splat1_char_110_V2 ()
{
  return vec_splats ((unsigned char)110);
}

test_splat1_char_110_V2:
xxspltib 34,110
blr

But fails when the vec_splats results is passed to vec_slo/vec_sll:

vui128_t
test_slqi_char_110_V3 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splats((unsigned char)110);
  result = vec_vslo ((vui8_t) vra, tmp);
  return (vui128_t) vec_vsl (result, tmp);
}

test_slqi_char_110_V3:
addis 9,2,.LC9@toc@ha
addi 9,9,.LC9@toc@l
lxv 32,0(9)
vslo 2,2,0
vsl 2,2,0
blr

Strangely GCC playes along with the even (but negative) numbers trick. For
example:

vui8_t
test_splat7_char_110_V0 ()
{ // 110-128 = -18
  // (-18 / 2) + (-18 / 2)
  // (-9) + (-9)
  vui8_t tmp = vec_splat_u8(-9);
  return vec_add (tmp, tmp);
}

test_splat7_char_110_V0:
xxspltib 34,247
vaddubm 2,2,2
blr

But fails when this value passed to vec_slo/vec_sll:

vui128_t
test_slqi_char_110_V2 (vui128_t vra)
{
  vui8_t result;
  vui8_t tmp = vec_splat_u8(-9);
  tmp = vec_vaddubm (tmp, tmp);
  result = vec_vslo ((vui8_t) vra, tmp);
  return (vui128_t) vec_vsl (result, tmp);
}

test_slqi_char_110_V2:
addis 9,2,.LC11@toc@ha
addi 9,9,.LC11@toc@l
lxv 32,0(9)
vslo 2,2,0
vsl 2,2,0
blr

Stranger yet, replacing the vaddubm with a shift left 1 

vui8_t
test_splat7_char__110_V4 ()
{ // 110 - 128 = -18 
  // -18 = (-9 * 2) = (-9 << 1)
  vui8_t v1 = vec_splat_u8(1);
  vui8_t tmp = vec_splat_u8(-9);
  return vec_sl (tmp, v1);
}

test_splat7_char__110_V4:
.LFB34:
.cfi_startproc
xxspltib 34,247
vaddubm 2,2,2
blr

When this is passed to vec_slo/vec_sll, GCC avoids the conversion to .rodata,
but converts the shift back to xxspltib/vaddubm. This is slightly better but
generates an extra (and unnecessary) instruction:

vui8_t
test_slqi_char_110_V4 (vui8_t vra)
{
  vui8_t result;
  // 110 = (-9 * 2) = (-9 << 1)
  vui8_t v1 = vec_splat_u8(1);
  vui8_t tmp = vec_splat_u8(-9);
  tmp = vec_sl (tmp, v1);
  result = vec_slo (vra, tmp);
  return vec_sll (result, tmp);
}

test_slqi_char_110_V4:
.LFB41:
.cfi_startproc
xxspltib 32,247
vaddubm 0,0,0
vslo 2,2,0
vsl 2,2,0
blr

Perhaps we are on to something!
- Avoid negative values
- Use explicit shift instead of add

So one last example generating the 7-bit shift-count as octet (times 8) plus
bit shift and using only positive values:

vui8_t
test_splat7_char_110_V1 ()
{
  // 110 = (13 * 8) + 4
  vui8_t v3 = vec_splat_u8(3);
  vui8_t tmp = vec_splat_u8(13);
  vui8_t tmp2 = vec_splat_u8(6);
  tmp = vec_sl (tmp, v3);
  return vec_add (tmp, tmp2);
}

test_splat7_char_110_V1:
xxspltib 34,110
blr

And:

vui8_t
test_slqi_char_110_V5 (vui8_t vra)
{
  vui8_t result;
  // 110 = (13 * 8) + 6
  vui8_t v3 = vec_splat_u8(3);
  vui8_t tmp = vec_splat_u8(13);
  vui8_t tmp2 = vec_splat_u8(6);
  tmp = vec_sl (tmp, v3);
  tmp = vec_add (tmp, tmp2);
  result = vec_slo (vra, tmp);
  return vec_sll (result, tmp);
}

test_slqi_char_110_V5:
xxspltib 32,110
vslo 2,2,0
vsl 2,2,0
blr

Finally we have a reasonable result that should have been possible with simple
vec_splats((unsigned char)110)!

Note: this looks like a possible workaround for generating vector splatted with
positive constants. It still looks like a problem with negative constants
persists.

[Bug target/119760] GCC does not implement intrinsics for Vector Multiply-by-10 Unsigned Quadword and varients

2025-04-14 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119760

--- Comment #2 from Steven Munroe  ---
(In reply to Richard Biener from comment #1)
> Likely because GCC doesn't know anything about BCD (no BCD "modes", builtins
> or optabs or direct internal functions).

As I stated the existing bcdadd/sub builtins/intrinsics are using the vector
__int128 type, which works fine. So GCC does not need a new type.

https://gcc.gnu.org/onlinedocs/gcc-14.2.0/gcc/PowerPC-AltiVec-Built-in-Functions-Available-on-ISA-2_002e07.html

GCC is missing builtins/intrinsics for useful instructions supported in
POWER9/10.

I was look at Vector Multiply-by-10 Unsigned Quadword carry/extend for
optimizations for decimal to binary (__int128) conversion and methods to
generate small integer constants for vector __int128.

And was reminded that promised builtins where not delivered.

[Bug target/119760] New: GCC does not implement intrinsics for Vector Multiply-by-10 Unsigned Quadword and varients

2025-04-12 Thread munroesj at gcc dot gnu.org via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119760

Bug ID: 119760
   Summary: GCC does not implement intrinsics for Vector
Multiply-by-10 Unsigned Quadword and varients
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: minor
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: munroesj at gcc dot gnu.org
  Target Milestone: ---

POWER8 (PowerISA 2.07) introduced Binary Coded Decimal (BCD) Add/Subtract.

GCC implemented builtins (__builtin_bcdadd/sub) operating on the vector
__int128 type for these instructions. This included predicated for comparison,

6.62.26.3 PowerPC AltiVec Built-in Functions Available on ISA 2.07

POWER9 (PowerISA 3.0) more BCD instructions (shift/trucate/Convert/Zoned).
All operated on VSRs (128-bit). Happy COBOL and RPG!

POWER9 also implemented Vector Multiply-by-10 Unsigned Quadword [with
carry/extend].
Also operating on VSRs. 

As far as I can tell none of the POWER9 BCD operations where implemented in
GCC.