https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102062

            Bug ID: 102062
           Summary: powerpc suboptimal unrolling simple array sum
           Product: gcc
           Version: 11.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
          Assignee: unassigned at gcc dot gnu.org
          Reporter: npiggin at gmail dot com
  Target Milestone: ---
            Target: powerpc64le-linux-gnu

--- test.c ---
int test(int *arr, int sz)
{
        int ret = 0;
        int i;

        if (sz < 1)
                __builtin_unreachable();

        for (i = 0; i < sz*2; i++)
                ret += arr[i];

        return ret;
}
---

gcc-11 compiles this to:
test:
        rldic 4,4,1,32
        addi 10,3,-4
        rldicl 9,4,63,33
        li 3,0
        mtctr 9
.L2:
        addi 8,10,4
        lwz 9,4(10)
        addi 10,10,8
        lwz 8,4(8)
        add 9,9,3
        add 9,9,8
        extsw 3,9
        bdnz .L2
        blr

I may be unaware of a constraint of C standard here, but maintaining the two
base addresses seems pointless, so is beginning the first at offset -4.

The bigger problem is keeping a single sum. Keeping two sums and adding them at
the end reduces critical latency of the loop from 6 to 2, which brings
throughput on large loops from 6 cycles per iteration down to about 2.2 on
POWER9 without harming short loops:

test:
        rldic 4,4,1,32
        rldicl 9,4,63,33
        mtctr 9
        li 8,0
        li 9,0
.L2:
        lwz 6,0(3)
        lwz 7,4(3)
        addi 3,3,8
        add  8,8,6
        add  9,9,7
        bdnz .L2
        add 9,9,8
        extsw 3,9
        blr

Any reason this can't be done?

Reply via email to