[Bug target/94393] New: Powerpc suboptimal 64-bit constant comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94393 Bug ID: 94393 Summary: Powerpc suboptimal 64-bit constant comparison Product: gcc Version: 9.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- --- test case void test1(unsigned long a) { if (a > 0xc000ULL) printf("yes\n"); } void test2(unsigned long a) { if (a >= 0xc000ULL) printf("yes\n"); } -- The first (important part) compiles to li 9,-1 rldicr 9,9,0,1 cmpld 0,3,9 blelr 0 The second to lis 9,0xbfff ori 9,9,0x sldi 9,9,32 oris 9,9,0x ori 9,9,0x cmpld 0,3,9 blelr 0 The second could use the same 2-insn constant as the first, but with bltlr.
[Bug target/94395] New: Powerpc suboptimal 64-bit constant generation near large values with few bits set
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94395 Bug ID: 94395 Summary: Powerpc suboptimal 64-bit constant generation near large values with few bits set Product: gcc Version: 9.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- 0xc000UL is generated with li 9,-1 rldicr 9,9,0,1 0xbfffUL (0xc000UL - 1) is lis 9,0xbfff ori 9,9,0x sldi 9,9,32 oris 9,9,0x ori 9,9,0x Could be li 9,-1 rldicr 9,9,0,1 subi 9,9,1
[Bug target/94393] Powerpc suboptimal 64-bit constant comparison
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94393 --- Comment #4 from Nicholas Piggin --- (In reply to Alan Modra from comment #3) > There are two parts to fixing this PR. > 1) We can do better in the sequences generated for some constants. > 2) rs6000_rtx_costs needs to be accurate, so that expensive constants are > put in memory and stay there with optimisation. > > Having looked at part 2 a little, I'd say fixes for that are definitely not > stage 4 material. Well the other part I thought is that different branch test could be chosen in order to use the cheapest constant. If you generate the 0xbfff... constant more cheaply it'd still be more expensive than 0xc000... That was my intention opening the two bugs but I didn't explain myself so well.
[Bug target/96017] New: Powerpc suboptimal register spill in likely path
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96017 Bug ID: 96017 Summary: Powerpc suboptimal register spill in likely path Product: gcc Version: 9.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- Target: powerpc64le-linux-gnu Build: gcc version 9.2.1 20190909 (Debian 9.2.1-8) -- test.c -- extern int foo; extern void slowpath(int *); int test(int *val) { int ret = foo; if (__builtin_expect(*val != 0, 0)) slowpath(val); return ret; } -- Compiling with -O2 gives the following asm. It seems possible for the fast path case to avoid the stack frame by using a volatile register to save the val argument in case the slow path needs it (or alternatively to save the load from 'foo', as r31 is used now, but that requires an extra register move on a critical path for the return value). This should be smaller and faster code even for the slow path too. addis r2,r12,0 addir2,r2,0 lwz r9,0(r3) addis r10,r2,0 ld r10,0(r10) std r31,-8(r1) stdur1,-48(r1) lwa r31,0(r10) cmpwi r9,0 bne 1f addir1,r1,48 mr r3,r31 ld r31,-8(r1) blr nop ori r2,r2,0 1: mflrr0 std r0,64(r1) bl slowpath nop ld r0,64(r1) addir1,r1,48 mr r3,r31 ld r31,-8(r1) mtlrr0 blr
[Bug target/96017] Powerpc suboptimal register spill in likely path
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96017 --- Comment #3 from Nicholas Piggin --- This is possibly a more targeted and simpler test case for at least one of the problems in bug 84443. I would like to re-test that case after this one is solved if it's not an obvious duplicate.
[Bug rtl-optimization/96475] New: direct threaded interpreter with computed gotos generates suboptimal dispatch loop
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96475 Bug ID: 96475 Summary: direct threaded interpreter with computed gotos generates suboptimal dispatch loop Product: gcc Version: 10.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: rtl-optimization Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com CC: segher at gcc dot gnu.org Target Milestone: --- Target: powerpc64le-linux-gnu Created attachment 48999 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48999&action=edit test case The attached test case code generation with -O2 for run_program_goto generates a central indirect branch dispatch to handlers that branch back to the central dispatcher. Direct threaded code with indirect branches between handlers is faster on a POWER9 when there are no branch mispredictions due to fewer branches, and it should generally do better with branch prediction when there is an indirect branch from each handler.
[Bug target/88765] New: powerpc64le-linux-gnu sub-optimal code generation for builtin atomic ops
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88765 Bug ID: 88765 Summary: powerpc64le-linux-gnu sub-optimal code generation for builtin atomic ops Product: gcc Version: 8.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- Target: powerpc64le-linux-gnu gcc version 8.2.0 (Debian 8.2.0-4) Linux uses a lot of non-trivial operations, and implementing them with compare_exchange results in sub-optimal code generation. A common one is add_unless_zero, which is commonly used with RCU to take a reference count, or fail if the last reference had already gone (which is a very rare case). --- #include bool add_unless_zero(unsigned long *mem, unsigned long inc) { unsigned long val; do { val = __atomic_load_n(mem, __ATOMIC_RELAXED); if (__builtin_expect(val == 0, false)) return false; } while (__builtin_expect(!__atomic_compare_exchange_n(mem, &val, val + inc, true, __ATOMIC_RELAXED, __ATOMIC_RELAXED), false)); return true; } --- This compiles to: add_unless_zero: .L4: ld 10,0(3) cmpdi 7,10,0 add 8,10,4 beq 7,.L5 ldarx 9,0,3 cmpd 0,9,10 bne 0,.L3 stdcx. 8,0,3 .L3: bne 0,.L4 li 3,1 blr .L5: li 3,0 blr Better would be add_unless_zero: .L4: ldarx 9,0,3 cmpdi 0,9,0 add 9,9,4 bne 0,.L5 stdcx. 8,0,3 bne 0,.L4 li 3,1 blr .L5: li 3,0 blr Using extended inline asm to implement these is an adequate workaround. Unfortunately that does not work on 128 bit powerpc because no register pair constraint, and much worse code generation with builtins. Changing types to __int128_t gives a bad result: add_unless_zero: lq 10,0(3) mr 6,3 or. 9,10,11 addc 3,11,4 mr 7,11 adde 11,10,5 beq 0,.L16 std 28,-32(1) std 29,-24(1) std 30,-16(1) std 31,-8(1) .L12: lqarx 28,0,6 xor 9,29,7 xor 10,28,10 or. 9,9,10 bne 0,.L4 mr 30,11 mr 31,3 stqcx. 30,0,6 .L4: mfcr 3,128 rlwinm 3,3,3,1 bne 0,.L17 .L2: ld 28,-32(1) ld 29,-24(1) ld 30,-16(1) ld 31,-8(1) blr .L17: lq 10,0(6) or. 9,10,11 addc 3,11,4 mr 7,11 adde 11,10,5 bne 0,.L12 li 3,0 b .L2 .L16: li 3,0 blr Closer to ideal would be add_unless_zero: .Lagain: lqarx 6,0,3 or. 8,6,7 addc6,6,4 adde7,7,5 beq 0,.Lfail stqcx. 6,0,3 bne 0,.Lagain li 3,1 blr .Lfail: li 3,0 blr
[Bug tree-optimization/84443] New: powerpc suboptimal code generation (shrink wrap unlikely path) for Linux spinlocks
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84443 Bug ID: 84443 Summary: powerpc suboptimal code generation (shrink wrap unlikely path) for Linux spinlocks Product: gcc Version: 8.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- Created attachment 43452 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43452&action=edit testcase with comment at the top describing desired output A small fast path code gets several non-volatile registers saved on stack, and a sub-optimal restore in the return path. The example is derived from (but not identical to) the Linux kernel spinlock implementation. Tested with gcc version 8.0.1 20180207 (experimental) [trunk revision 257435] (Debian 8-20180207-2).
[Bug c/84483] New: powerpc sub-optimal code generation - conditional relative branch to and over blr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84483 Bug ID: 84483 Summary: powerpc sub-optimal code generation - conditional relative branch to and over blr Product: gcc Version: 8.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- Created attachment 43472 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43472&action=edit test case with description in comment at the top I have attached a test case derived loosely from Linux spinlock code that has two possible types of sub-optimal code generation with relative branches used to execute or avoid a blr. Conditional blr might be used instead.
[Bug c/84514] New: powerpc sub optimal condition register reuse with extended inline asm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84514 Bug ID: 84514 Summary: powerpc sub optimal condition register reuse with extended inline asm Product: gcc Version: 8.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- Created attachment 43488 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=43488&action=edit test case with description in comment at the top There seem to be some missed opportunities reusing condition register over extended asm (that does not clobber cc).
[Bug middle-end/84443] powerpc suboptimal code generation (shrink wrap unlikely path) for Linux spinlocks
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84443 --- Comment #3 from Nicholas Piggin --- (In reply to Segher Boessenkool from comment #2) > If you want some specific machine code for something complex like this, it > is much easier to write the machine code than to fight the compiler. Yes, understood. This may be the way we go eventually. I thought the test case could be interesting for reference. Often times we do juggle around the C/inline asm code to get a good pattern, which can be a bit easier than writing it completely in asm. It does not have to be absolutely optimal of course. Thanks for taking a look and providing some explanation of the problem. Any improvement would be great. > > That said: > > 1) "flags" is stored in the same register everywhere. This is a problem in > expand: it puts the return value in the same pseudo in the whole function. > This is a bad idea because (if it did not) after optimisation the return > value is just a hard register (r3) and putting all _other_ uses in the same > register is a pessimisation; like here (and in many other cases, there are > other PRs for this!) it causes that pseudo to be put in a callee-save > register (r30). > > 2) That should be fixed in expand (to enable other optimisations with the > return pseudo), but IRA should also be smarter about live-range splitting > for shrink-wrapping. > > 3) Separate shrink-wrapping should deal with CRFs. I have a prototype, > it has a few small problems (we need to handle things a little differently > for pretty much every ABI, sigh), and it does not help very much until the > other problems are solved; GCC 9 work); > > 4) Shrink-wrapping itself could also do more live-range splitting than it > currently does.
[Bug target/84626] New: powerpc toc register is reloaded unnecessarily
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84626 Bug ID: 84626 Summary: powerpc toc register is reloaded unnecessarily Product: gcc Version: 8.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- gcc version 8.0.1 20180207 (experimental) [trunk revision 257435] (Debian 8-20180207-2): Test case: void test(void (*fn)(void), unsigned long i) { while (i--) fn(); } Generates code: 2c: 18 00 41 f8 std r2,24(r1) 30: 00 00 00 60 nop 34: 00 00 00 60 nop 38: 00 00 00 60 nop 3c: 00 00 42 60 ori r2,r2,0 40: 78 f3 cc 7f mr r12,r30 44: a6 03 c9 7f mtctr r30 48: ff ff ff 3b addir31,r31,-1 4c: 21 04 80 4e bctrl 50: 18 00 41 e8 ld r2,24(r1) 54: ff ff bf 2f cmpdi cr7,r31,-1 58: e8 ff 9e 40 bne cr7,40 The r2 load could be moved out of the loop.
[Bug target/84627] New: powerpc excess padding generated for POWER9 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84627 Bug ID: 84627 Summary: powerpc excess padding generated for POWER9 target Product: gcc Version: 8.0.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- gcc version 8.0.1 20180207 (experimental) [trunk revision 257435] (Debian 8-20180207-2): Test case: void test(void (*fn)(void), unsigned long i) { while (i--) fn(); } Generates code with -O2 -mcpu=power9: 2c: 18 00 41 f8 std r2,24(r1) 30: 00 00 00 60 nop 34: 00 00 00 60 nop 38: 00 00 00 60 nop 3c: 00 00 42 60 ori r2,r2,0 40: 78 f3 cc 7f mr r12,r30 44: a6 03 c9 7f mtctr r30 48: ff ff ff 3b addir31,r31,-1 4c: 21 04 80 4e bctrl 50: 18 00 41 e8 ld r2,24(r1) 54: ff ff bf 2f cmpdi cr7,r31,-1 58: e8 ff 9e 40 bne cr7,40 On power9, I wonder if nops and ori should be avoided? Aligning to quad word boundary is reasonable for fetch, but is there any advantage for dispatch to add the extra padding?
[Bug target/84627] powerpc excess padding generated for POWER9 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84627 --- Comment #2 from Nicholas Piggin --- Ah sorry, target is powerpc64le
[Bug target/84627] powerpc excess padding generated for POWER9 target
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84627 Nicholas Piggin changed: What|Removed |Added Resolution|INVALID |FIXED --- Comment #4 from Nicholas Piggin --- After some more discussion and testing, it was determined that for POWER9, 16 bytes of padding is sufficient for these small loops and was adjusted. https://gcc.gnu.org/viewcvs/gcc?view=revision&revision=258260 Author: segher Date: Mon Mar 5 19:11:54 2018 UTC (3 hours, 27 minutes ago) Log Message: rs6000: Don't align tiny loops to 32 bytes for POWER9 For POWER4..POWER8 we align loops of 5..8 instructions to 32 bytes (instead of to 16 bytes) because that executes faster. This is no longer the case on POWER9, so we can just as well only align to 16 bytes. * config/rs6000/rs6000.c (rs6000_loop_align): Don't align tiny loops to 32 bytes when compiling for POWER9.
[Bug middle-end/71509] Bitfield causes load hit store with larger store than load
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71509 Nicholas Piggin changed: What|Removed |Added CC||npiggin at gmail dot com --- Comment #5 from Nicholas Piggin --- This test case seems like it may be related. It does the right thing and uses all 4 byte ops when the 8 byte alignment is removed. I post it here because it may not always be the case that smallest op is best struct s { unsigned long align1; union { unsigned int blah; unsigned int a:1; }; }; void test2(struct s *s) { s->blah = 100; if (s->a) asm volatile("#blah"); } Generates (gcc 7.0.1) test2: li 9,100 stw 9,8(3) ld 9,8(3) andi. 9,9,0x1 beqlr 0 #APP # 29 "a.c" 1 #blah # 0 "" 2 #NO_APP blr There is a more general case of mismatched load and store sizes in unions with different size types, but in this case the sizes could be matched I think.
[Bug tree-optimization/77647] New: Missed opportunity to use register value causes additional load
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=77647 Bug ID: 77647 Summary: Missed opportunity to use register value causes additional load Product: gcc Version: 6.2.0 Status: UNCONFIRMED Severity: enhancement Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- static inline long load(long *p) { long ret; asm ("movq %1,%0\n\t" : "=r" (ret) : "m" (*p)); if (ret != *p) __builtin_unreachable(); return ret; } long foo(long *mem) { long ret; ret = load(mem); return ret - *mem; } foo() compiles down to 'xorl %eax,%eax ; ret' which is great. Changing the minus to plus gives movq(%rdi),%rax addq(%rdi),%rax ret
[Bug target/102485] -Wa,-many no longer has any effect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102485 Nicholas Piggin changed: What|Removed |Added CC||npiggin at gmail dot com --- Comment #6 from Nicholas Piggin --- (In reply to Peter Bergner from comment #1) > I agree it is GCC's job to emit a ".machine CPU" directive that allows the > assembler to correctly assemble the code GCC generates. GCC already passes -m to the assembler though. The justification for emitting the .machine directive is given as fixing a build breakage due to a build system that passes an incorrect -m to the assembler. *That* is the broken code (if any) that should have been fixed. But instead that is hacked around in a way that breaks working code that passes down -Wa,-many option as specified. > Your test case > however uses inline asm and GCC does not know that you are using the mfppr32 > mnemonic. The assembler code you write in an inline asm is basically a > black box to the compiler. It is therefor up to the programmer to ensure > that either the -mcpu=CPU GCC option that is being used (which > emits".machine CPU" directive) is enough to assemble the mnemonics in the > inline asm or you have to emit them in your inline asm. The kernel builds with a base compatibility (say -mcpu=power4) and then has certain code paths that are dynamically taken if running on newer CPUs which use newer instructions with inline asm. This is an important usage and it's pervasive, it seems unreasonable to break it. Adding .machine directives throughout inline asm for every instruction not in the base compatibility class is pretty horrible.
[Bug target/102485] -Wa,-many no longer has any effect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102485 --- Comment #8 from Nicholas Piggin --- (In reply to Segher Boessenkool from comment #7) > > GCC already passes -m to the assembler though. > > That mostly is historic. So? I was pointing out the compiler already tells the assembler what instruction set to use without the .machine directive. > > > The justification for emitting the .machine directive is given as fixing a > > build breakage due to a build system that passes an incorrect -m to the > > assembler. > > Not really, no. That is really the justification for emitting the .machine directive as provided in the changelog of the commit which introduced the change. > That is just one tiny part of the problem. It is impossible > to know what instruction sets we need ahead of time, and -many cannot work > (and > *does not* work: there are quite a few mnemonics that encode to different > insns > on different architecture versions (or for different CPUs), and we cannot > know > which is wanted, or which is preferred, ahead of time. I understand the problems with -many, but it can and does work for real software. E.g., Linux kernel as of before this change. It's not -many I'm wedded to though, it's any ability to fix this sanely because of the .machine directive. The kernel should would change to using a specific CPU, e.g., -mcpu=power4 -Wa,-mpower10 and deal with the very rare few clashing mnemonics (e.g., tlbie) specially with the .machine directive when an older one is required. > > > *That* is the broken code (if any) that should have been fixed. But instead > > that is hacked around in a way that breaks working code that passes down > > -Wa,-many option as specified. > > There is no working code that uses -many (accept by accident, if no problem > has hit you yet). I'll reword. "Instead that is hacked around in a way that breaks working code that passes down the -Wa,-m option as specified." > > > The kernel builds with a base compatibility (say -mcpu=power4) and then has > > certain code paths that are dynamically taken if running on newer CPUs which > > use newer instructions with inline asm. > > > > This is an important usage and it's pervasive, it seems unreasonable to > > break it. Adding .machine directives throughout inline asm for every > > instruction not in the base compatibility class is pretty horrible. > > It is the only correct thing to do. It's not. Not passing .machine and passing -mcpu is just as correct. With the added bonus that it allows software to use a superset of instructions in such cases. And even the great bonus that existing "broken" code that uses -many will continue to work. The correct way to deal with this is not to break this, it is to add a warning to -many for some period to binutils to give code a chance to migrate. I'm all for removing -many, and that is the right way to do it. By deprecating -many and providing warnings. Not by hacking around it in the compiler that breaks things.
[Bug target/102485] -Wa,-many no longer has any effect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102485 --- Comment #9 from Nicholas Piggin --- And upstream gas still doesn't even warn with -many!!
[Bug c/104671] New: -Wa,-m no longer has any effect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104671 Bug ID: 104671 Summary: -Wa,-m no longer has any effect Product: gcc Version: 10.3.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com CC: amodra at gcc dot gnu.org, bergner at gcc dot gnu.org, dje at gcc dot gnu.org, meissner at gcc dot gnu.org, pc at us dot ibm.com, segher at gcc dot gnu.org, wschmidt at gcc dot gnu.org Target Milestone: --- Commit e154242724b084380e3221df7c08fcdbd8460674 ("Don't pass -many to the assembler") also added a workaround emitting a ".machine" directive to the top of each generated .s file corresponding to -mcpu= option, to work around buggy build code which uses incorrect -Wa,-mcpu options: This patch also emits .machine assembler directives for ELF targets when functions are compiled for different cpus via attributes or pragmas. That's necessary when the initial -m option passed to the assembler doesn't enable the superset of all opcodes emitted, as seen by a failure of gcc.target/powerpc/clone2.c without .machine when building gcc for power8. This also prevents passing a -m to the assembler which is a strict superset of the -mcpu= generated code. This is a valid situation where .c code is built with a baseline architecture compatibility plus dynamic support for newer instructions with inline asm. The Linux kernel extensively uses this pattern.
[Bug c/104671] -Wa,-m no longer has any effect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104671 --- Comment #1 from Nicholas Piggin --- The comment in recent binutils.git commit cebc89b9328 sheds some more light on this and possibly provides a workaround in binutils for the errant .machine directive. The referenced gcc bug #101393 looks like it was proposed to remove .machine but I'm not entirely sure it was also talking about other things, and that hasn't been resolved.
[Bug target/104671] -Wa,-m no longer has any effect
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104671 --- Comment #4 from Nicholas Piggin --- (In reply to Alan Modra from comment #3) > (In reply to Nicholas Piggin from comment #0) > > Commit e154242724b084380e3221df7c08fcdbd8460674 ("Don't pass -many to the > > assembler") also added a workaround emitting a ".machine" directive to the > > top of each generated .s file > > Nope, wrong commit. The added .machine there is one in response to an > attribute or pragma selecting a different cpu. ie. it's in response to user > source code saying "compile this function for this particular cpu, not > necessarily the one given on the gcc command line". > > Commit b9f12a01bedb5 is the first commit that added .machine to the top of > assembly files. The early .machine was not always emitted by that > particular commit. Always emitting the early .machine came later. Thanks for the correction. And thank you for taking the time to write the detailed explanation in bug #102485, it's very helpful. Not sure where it leaves this report wrt gcc, but upstream binutils solves the problem for me so I would be happy to consider it resolved. It is what it is, we can patch Linux to fix the build issues, the proposed fixes don't look too onerous. And we need to remove -many from our build anyway (should have done so a long time ago) so this will give the necessary prod.
[Bug c/106895] New: powerpc64 strange extended inline asm behaviour with register pairs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106895 Bug ID: 106895 Summary: powerpc64 strange extended inline asm behaviour with register pairs Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com CC: segher at gcc dot gnu.org Target Milestone: --- Target: powerpc64le-linux-gnu This is Debian 12.2.0-1 build. powerpc64 instructions operating on 128 bits use "even/odd pair" of adjacent GPRs, where the lower numbered register specifies the pair and it must be an even number. This code compiled with -O2: void set128(__int128 val, __int128 *mem) { #if WORKAROUND asm(""); #endif asm("stq %1,%0" : "=m"(*mem) : "r"(val)); } Results in an invalid even/odd pair: stq 3,0(5) blr If compiled with -DWORKAROUND it results in a valid pair: mr 10,3 mr 11,4 stq 10,0(5) blr
[Bug c/102169] New: powerpc64 int memory operations using FP instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102169 Bug ID: 102169 Summary: powerpc64 int memory operations using FP instructions Product: gcc Version: 12.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com CC: bergner at gcc dot gnu.org Target Milestone: --- Target: powerpc64le-linux-gnu --- test.c --- int foo, bar; void test(void) { foo = bar; } --- Using Debian gcc 10.2 with -O2 flag, this compiles to: addis r2,r12,0 addir2,r2,0 addis r9,r2,0 addir9,r9,0 lfiwzx f0,0,r9 addis r9,r2,0 addir9,r9,0 stfiwx f0,0,r9 blr Peter confirmed it also uses FP registers with trunk (but I don't have the asm output at hand). This can be suboptimal on some processors, e.g., on POWER9 lfiwzx is "Tuple Restricted (R)" which reduces dispatch throughput on the cycle it is dispatched. And generally just seems like a surprising thing to do with no shortage of GPRs.
[Bug c/102239] New: powerpc suboptimal boolean test of contiguous bits
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102239 Bug ID: 102239 Summary: powerpc suboptimal boolean test of contiguous bits Product: gcc Version: 11.2.1 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- Target: powerpc64le-linux-gnu gcc version 11.2.1 20210815 (Debian 11.2.0-2) Build flags -O2 --- test.c --- void foo(long arg) { if (arg & ((1UL << 33) | (1UL << 34))) asm volatile("# if"); else asm volatile("# else"); } --- generates: foo: rldicr 3,3,29,1 srdi. 3,3,29 beq 0,.L6 # if blr .L6: # else blr This test of multiple contiguous bits could be tested with a single instruction with rldicr. or rldicl. Those instructions do tend to be more expensive than the Rc=0 form (C2 cracked on POWER9, slower pipe class on POWER10), but I think they should be better than the current 2-instruction sequence.
[Bug rtl-optimization/102062] powerpc suboptimal unrolling simple array sum
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102062 Nicholas Piggin changed: What|Removed |Added Resolution|--- |FIXED Status|NEW |RESOLVED --- Comment #16 from Nicholas Piggin --- gcc-11 -O2 -mcpu=power10 now generates identical code that Bill listed for trunk: .L2: lwz 4,4(9) lwz 5,12(9) lwz 6,8(9) lwzu 8,16(9) add 10,4,10 add 10,10,5 add 7,6,7 add 7,7,8 bdnz .L2 Thanks all.
[Bug c/108018] New: Wide immediate sequences not scheduled for POWER10 fusion
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108018 Bug ID: 108018 Summary: Wide immediate sequences not scheduled for POWER10 fusion Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- POWER10 has "wideimmediate" fusion sequences that include addi rx,ra,si ; addis rx,rx,SI and addis rx,ra,si ; addi rx,rx,SI --- test.c --- static int foo, bar; unsigned long test(void) { return (unsigned long)&foo + (unsigned long)&bar; } --- This test case when compiled with -O2 -mcpu=power10 -mno-pcrel results in addis 2,12,.TOC.-.LCF0@ha addi 2,2,.TOC.-.LCF0@l addis 3,2,.LANCHOR0@toc@ha addis 9,2,.LANCHOR0+4@toc@ha addi 3,3,.LANCHOR0@toc@l addi 9,9,.LANCHOR0+4@toc@l add 3,3,9 blr The TOC pointer generation is scheduled properly because it is the global entry prologue, but the variable address generation is scheduled to favour dependencies rather than the static fusion sequences that P10 has, which should be preferable.
[Bug target/108239] New: -mprefixed causes too large displacements for extended inline asm memory operands
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108239 Bug ID: 108239 Summary: -mprefixed causes too large displacements for extended inline asm memory operands Product: gcc Version: 12.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com CC: segher at gcc dot gnu.org Target Milestone: --- Target: powerpc64le-linux-gnu --- test.c --- // powerpc64le-linux-gnu-gcc -O2 -mcpu=power10 -mno-pcrel -c test.c #include static inline uint32_t readl(uint32_t *addr) { uint32_t ret; __asm__ __volatile__("lwz %0,%1" : "=r" (ret) : "m" (*addr)); return ret; } int test(void *addr) { return readl(addr + 0x8024); } --- This generates invalid assembly 'lwz 3,32804(3)' with displacement exceeding allowable size. Using -mno-prefixed fixes the problem.
[Bug c/102062] New: powerpc suboptimal unrolling simple array sum
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102062 Bug ID: 102062 Summary: powerpc suboptimal unrolling simple array sum Product: gcc Version: 11.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: npiggin at gmail dot com Target Milestone: --- Target: powerpc64le-linux-gnu --- test.c --- int test(int *arr, int sz) { int ret = 0; int i; if (sz < 1) __builtin_unreachable(); for (i = 0; i < sz*2; i++) ret += arr[i]; return ret; } --- gcc-11 compiles this to: test: rldic 4,4,1,32 addi 10,3,-4 rldicl 9,4,63,33 li 3,0 mtctr 9 .L2: addi 8,10,4 lwz 9,4(10) addi 10,10,8 lwz 8,4(8) add 9,9,3 add 9,9,8 extsw 3,9 bdnz .L2 blr I may be unaware of a constraint of C standard here, but maintaining the two base addresses seems pointless, so is beginning the first at offset -4. The bigger problem is keeping a single sum. Keeping two sums and adding them at the end reduces critical latency of the loop from 6 to 2, which brings throughput on large loops from 6 cycles per iteration down to about 2.2 on POWER9 without harming short loops: test: rldic 4,4,1,32 rldicl 9,4,63,33 mtctr 9 li 8,0 li 9,0 .L2: lwz 6,0(3) lwz 7,4(3) addi 3,3,8 add 8,8,6 add 9,9,7 bdnz .L2 add 9,9,8 extsw 3,9 blr Any reason this can't be done?
[Bug c/102062] powerpc suboptimal unrolling simple array sum
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102062 --- Comment #3 from Nicholas Piggin --- (In reply to Bill Schmidt from comment #2) > As expected, I get similar code when compiling either for P9 or P10. Oh I should have specified, -O2 is the only option. If I add -fvariable-expansion-in-unroller it has no effect, just to make sure. It's gcc from Debian (gcc version 11.2.0 (Debian 11.2.0-3)). Maybe they've done something to change this.
[Bug c/102062] powerpc suboptimal unrolling simple array sum
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102062 --- Comment #5 from Nicholas Piggin --- (In reply to Bill Schmidt from comment #2) > As expected, I get similar code when compiling either for P9 or P10. Oh I should have specified, -O2 is the only option. If I add -fvariable-expansion-in-unroller it has no effect, just to make sure. It's gcc from Debian (gcc version 11.2.0 (Debian 11.2.0-3)). Maybe they've done something to change this.(In reply to Bill Schmidt from comment #1) > Regarding the latter question, I'm surprised it's not being done. This > behavior is controlled by -fvariable-expansion-in-unroller, which was > enabled by default for PowerPC targets a couple of releases back. You > reported this against GCC 11.2, but I'm skeptical. What options are you > using? > > Compiling with -O2 and current trunk, I see variable expansion kicking in, > and I also see the same base register in use in all references in the loop: > > test: > .LFB0: > .cfi_startproc > .localentry test,1 > slwi 4,4,1 > li 10,0 > li 7,0 > addi 9,3,-4 > extsw 4,4 > andi. 6,4,0x3 > addi 5,4,-1 > mr 8,4 > beq 0,.L9 > cmpdi 0,6,1 > beq 0,.L13 > cmpdi 0,6,2 > bne 0,.L22 > .L14: > lwzu 6,4(9) > addi 4,4,-1 > add 10,10,6 > .L13: > lwzu 6,4(9) > cmpdi 0,4,1 > add 10,10,6 > beq 0,.L19 > .L9: > srdi 8,8,2 > mtctr 8 > .L2: > lwz 4,4(9) > lwz 5,12(9) > lwz 6,8(9) > lwzu 8,16(9) > add 10,4,10 > add 10,10,5 > add 7,6,7 > add 7,7,8 > bdnz .L2 > .L19: > add 3,10,7 > extsw 3,3 > blr > .p2align 4,,15 > .L22: > lwz 10,0(3) > mr 9,3 > mr 4,5 > b .L14 That asm does well on the test, better than my version (a little bit on P9, a lot on P10). It does have 2x more unrolling which probably helps a bit.
[Bug rtl-optimization/102062] powerpc suboptimal unrolling simple array sum
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102062 --- Comment #14 from Nicholas Piggin --- (In reply to Bill Schmidt from comment #10) > Well, the problem is that we still generate suboptimal code on GCC 11. I > don't know whether we want to address that or not. > > I suppose we aren't going to backport Haochen's lovely patch for sign > extensions, and it's probably not worth messing around with variable > expansion just for GCC 11... > > Nick, can you use the -ftree-vectorize -fvect-cost-model=cheap to deal with > this (or even use -O3, which is recommended and probably does the same > thing)? Or are you holding out for some fix in GCC 11.x? For my case no but it's only a very small issue in the scheme of things, I only found it by looking at code generation. So you can close immediately if it's fixed upstream, or whatever else you think best. Thanks all.
[Bug target/106895] powerpc64 strange extended inline asm behaviour with register pairs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106895 Nicholas Piggin changed: What|Removed |Added Resolution|INVALID |--- Status|RESOLVED|UNCONFIRMED --- Comment #4 from Nicholas Piggin --- I would like to reopen this. The example provided is not the desired operation, it was just an example. Such register-pair instructions in Power are not usable in extended inline asm without this (except maybe by explicit fixed allocation with clobbers or register variables which is undesirable).
[Bug target/106895] powerpc64 unable to specify even/odd register pairs in extended inline asm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106895 --- Comment #9 from Nicholas Piggin --- (In reply to Segher Boessenkool from comment #8) > (In reply to Peter Bergner from comment #6) > > (In reply to Segher Boessenkool from comment #5) > > > Constraints are completely the wrong tool for this. Just use modes, which > > > *are* the right tool? > > > > Well you cannot specify modes in the asm, so I think you're saying we need > > use the correct type that maps to a internal to GCC mode that has the > > even/odd register behavior, so something like: > > > > unsigned int foo __attribute__ ((mode (XX))); > > > > ...where XXmode is the new integer mode that gives us even/odd register > > pairs? Of course we have to be careful about how this all works wrt -m32 > > versus -m64. > > No, the type there is "unsigned int". I meant to say exactly what I did say: > just use modes. Which you indeed do in user code by the mode attribute, yes. > > And you do not need a new mode: PTImode should just work. But the user > specifying that is currently broken it seems? I don't know why constraint is wrong and mode is right or why TI doesn't work but PTI apparently would, but I'll take anything that works. Could we get PTI implemented? Does it need a new issue opened? > > Without -mpowerpc64 you cannot *have* 128-bit integers in registers. That > should be > fixed, but you cannot have it in just *two* registers, which is what is > required > here. For most targets that then means -m64 is required.
[Bug target/106895] powerpc64 unable to specify even/odd register pairs in extended inline asm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106895 --- Comment #11 from Nicholas Piggin --- (In reply to Segher Boessenkool from comment #10) > (In reply to Nicholas Piggin from comment #9) > > I don't know why constraint is wrong and mode is right > > Simple: you would need O(2**T*N) constraints for our existing N register > constraints, together with T features like this. But only O(2**T) modes at > most. I guess that would be annoying if you couldn't have modifiers on constraints or a bad algorithm for working them out. Fair enough. > > > or why TI doesn't work but PTI apparently would, > > Because this is exactly what PTImode is *for*! Right I accept it is, I meant I just would not have been able to work it out (assuming if PTI was documented it would be "Partial Tetra Integer" and be no more useful than the other P?I type documentation. > > > but I'll take anything that works. Could we > > get PTI implemented? Does it need a new issue opened? > > It was implemented in 2013. The restriction to only even pairs was a bugfix, > also from 2013. > > If you have code like > > typedef __int128 __attribute__((mode(PTI))) even; > > you get an error like > > error: no data type for mode 'PTI' > > This needs fixing. You can keep it in this PR? Sure, that would be much appreciated.
[Bug target/106895] powerpc64 unable to specify even/odd register pairs in extended inline asm
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106895 --- Comment #13 from Nicholas Piggin --- (In reply to Segher Boessenkool from comment #12) > > I guess that would be annoying if you couldn't have modifiers on constraints > > There is no such thing as "operand modifiers". There are *output* modifiers: > they change how an operand is *printed*, they do not change the operand in > any > way, shape, or form. No I mean like x, y, z for different register pairs, or Pn where n=0..4, eI, eP, eQ for different immediates. If they're practically just implemented as another constraint, what is really the big deal with adding a even/odd GPR pair constraint anyway? You say it's 2^T^N additional constraints. As far as I can tell it turns out to be one extra constrain, or maybe 2 if you need one that excludes r0. > > > or a bad algorithm for working them out. Fair enough. > > No idea what you mean here? If you can't deal well with more constraints. > > > > > or why TI doesn't work but PTI apparently would, > > > > > > Because this is exactly what PTImode is *for*! > > > > Right I accept it is, I meant I just would not have been able to work it out > > (assuming if PTI was documented it would be "Partial Tetra Integer" and be > > no more useful than the other P?I type documentation. > > For the rs6000 port, multi-register operands are not restricted to aligned > register numbers ("even/odd pairs"). (Some other ports do have this). We > use > the existing PTI mode for that (it also can be allocated in GPRs only, never > in > VSRs, unlike TImode). > > "Partial" does not have much meaning here. A minority of ports use partial > integer words for what they were introduced for originally: modes that are > smaller than a full register, say, a 24-bit mode when registers are 32 bits. > > We use it as another integer mode that is the same size. It is unfortunate > that we still have to resort to such tricks. Okay that explains it, thank you.