[Bug target/115973] New: PPCLE: Inefficient code for __builtin_uaddll_overflow and __builtin_addcll

2024-07-17 Thread jens.seifert at de dot ibm.com via Gcc-bugs
: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- unsigned long long add(unsigned long long a, unsigned long long b, unsigned long long *ovf) { return

[Bug target/117568] New: z13: Use vector instructions for fixed length memcmp

2024-11-13 Thread jens.seifert at de dot ibm.com via Gcc-bugs
Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- #include #include Up to 16 bytes consider using vector instructions for memcmp. This is not required for 1,2,4,8 bytes, but for the rest. For general

[Bug target/117561] New: z13/z14 Please add a scalar_test_data_class builtin

2024-11-13 Thread jens.seifert at de dot ibm.com via Gcc-bugs
Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- I found no way to efficient check fp data class on z using wftcidb (z13) and wftcisb(z14) instruction. For PowerPC scalar_test_data_class exists and provides

[Bug target/117928] New: z14 builtin for VLBR instruction missing

2024-12-05 Thread jens.seifert at de dot ibm.com via Gcc-bugs
: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- I want to use the z14 vlbr instruction, but I found no builtin for them. The assembler claims "unknown" mnemonic for vlbr, but I see the instruction in the &quo

[Bug target/119468] New: PPCLE: Inefficient implementation of __builtin_parityll

2025-03-25 Thread jens.seifert at de dot ibm.com via Gcc-bugs
Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- bool parity(unsigned long long l) { return __builtin_parityll(l); } bool parity2(unsigned long long l) { return

[Bug target/119494] New: z196: Inefficient implementation for __builtin_parityll for z196 < z15

2025-03-27 Thread jens.seifert at de dot ibm.com via Gcc-bugs
ity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- bool parityll(unsigned long long x) { return __builtin_parityll(x); } Code generation for z15 and above is opti

[Bug target/119702] New: PPCLE: Inefficient auto-vectorization for 64-bit shifts on Power9

2025-04-09 Thread jens.seifert at de dot ibm.com via Gcc-bugs
Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- void lshift1(unsigned long long *a) { a[0] <<= 1; a[1] <<= 1; } Output: lshift1(unsigned long long*):

[Bug target/119468] PPCLE: Inefficient implementation of __builtin_parityll

2025-04-09 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119468 --- Comment #2 from Jens Seifert --- popcnt + parity is slower than just 64-bit popcount and extracting last bit. "missed-optimization" opportunity applies as well to big endian. Optimal code: popcntd 3, 3 clrldi 3, 3, 63

[Bug target/119468] PPCLE: Inefficient implementation of __builtin_parityll

2025-04-09 Thread jens.seifert at de dot ibm.com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119468 --- Comment #4 from Jens Seifert --- clang is emitting extended mnemonics. On gcc, I only can enforce this by using inline assembly: unsigned long long parityfast(unsigned long long in) { __asm__("popcntd %0,%1":"+r"(in)); return in & 1

[Bug target/119912] New: PPC: Inefficient vector immediate shifts

2025-04-23 Thread jens.seifert at de dot ibm.com via Gcc-bugs
: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Shifts by -1 should be performed by a 0xFF..FF constant as PPC has modulo shift and the constant generation for 0xFF..FF requires just 1 instruction. On Power9 always use

<    1   2