https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95704

            Bug ID: 95704
           Summary: PPC: int128 shifts should be implemented branchless
           Product: gcc
           Version: 8.3.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: target
          Assignee: unassigned at gcc dot gnu.org
          Reporter: jens.seifert at de dot ibm.com
  Target Milestone: ---

Created attachment 48741
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=48741&action=edit
input with branchless 128-bit shifts

PowerPC processors don't like branches and branch mispredicts lead to large
overhead.

shift left/right unsigned __in128 can be implemented in 8 instructions which
can be processed on 2 pipelines almost in parallel leading to ~5 cycle latency
on Power 7 and 8.
shift right algebraic __int128 can be implemented in 10 instructions.
Overall comparable in latency of the branching code.

In attached file you find the branch less implementations in C. And I know that
this is using undefined behavior. But the resulting assembly is the interesting
part. 

The unnecessary rldicl 8,5,0,32 at the beginning of the routines are also not
necessary.

Reply via email to