https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704
Bug ID: 95704 Summary: PPC: int128 shifts should be implemented branchless Product: gcc Version: 8.3.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: jens.seifert at de dot ibm.com Target Milestone: --- Created attachment 48741 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48741&action=edit input with branchless 128-bit shifts PowerPC processors don't like branches and branch mispredicts lead to large overhead. shift left/right unsigned __in128 can be implemented in 8 instructions which can be processed on 2 pipelines almost in parallel leading to ~5 cycle latency on Power 7 and 8. shift right algebraic __int128 can be implemented in 10 instructions. Overall comparable in latency of the branching code. In attached file you find the branch less implementations in C. And I know that this is using undefined behavior. But the resulting assembly is the interesting part. The unnecessary rldicl 8,5,0,32 at the beginning of the routines are also not necessary.