Hi,

On the POWER5, gcc 4.2 gets roughly half the performance of gcc 3.3.3 on the
best ATLAS DGEMM kernel.  By throwing the flags 
   -fno-schedule-insns -fno-rerun-loop-opt
I'm able to get most of that performance back.  The most important flag is the
no-schedule-insns, so I suspect the scheduler was rewritten between these
releases.

I will append a tarfile that will build a simplified kernel so you can see the
affects yourself.  This kernel is simplified, so it doesn't have quite the
performance of the best one, but the general trend is the same (the best kernel
is way to complicated to use).

One thing that you might scope out is a feature we have found on the
PowerPC970FX (the direct decendent of the POWER5): I went from 69% of peak to
85% by scheduling like instructions in sets of 4 (i.e. do 4 loads, then 4
fmacs, etc, even when this hurts advancing loads).  Instruction alignment is
also important on this architecture, despite it being putatitively RISC.  I
think both these features are results of it's complicated front-end, which does
something similar to RISC-to-VLIW translation on the fly.  I suspect the
sets-of-4 rule helps in tracking the groups, but I don't know for sure . . .

This scheduling seems to hurt the POWER4 only slightly.  I have been trying to
install gcc 4.2 on PowerPC970FX, but so far no luck (it doesn't seem to like
MacOSX).  I will let you know if I get results for the PowerPC970FX.

Let me know if there is something else you need.

Cheers,
Clint


-- 
           Summary: disastrous scheduling for POWER5
           Product: gcc
           Version: 4.2.0
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: c
        AssignedTo: unassigned at gcc dot gnu dot org
        ReportedBy: whaley at cs dot utsa dot edu


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32523

Reply via email to