Hi, On the POWER5, gcc 4.2 gets roughly half the performance of gcc 3.3.3 on the best ATLAS DGEMM kernel. By throwing the flags -fno-schedule-insns -fno-rerun-loop-opt I'm able to get most of that performance back. The most important flag is the no-schedule-insns, so I suspect the scheduler was rewritten between these releases.
I will append a tarfile that will build a simplified kernel so you can see the affects yourself. This kernel is simplified, so it doesn't have quite the performance of the best one, but the general trend is the same (the best kernel is way to complicated to use). One thing that you might scope out is a feature we have found on the PowerPC970FX (the direct decendent of the POWER5): I went from 69% of peak to 85% by scheduling like instructions in sets of 4 (i.e. do 4 loads, then 4 fmacs, etc, even when this hurts advancing loads). Instruction alignment is also important on this architecture, despite it being putatitively RISC. I think both these features are results of it's complicated front-end, which does something similar to RISC-to-VLIW translation on the fly. I suspect the sets-of-4 rule helps in tracking the groups, but I don't know for sure . . . This scheduling seems to hurt the POWER4 only slightly. I have been trying to install gcc 4.2 on PowerPC970FX, but so far no luck (it doesn't seem to like MacOSX). I will let you know if I get results for the PowerPC970FX. Let me know if there is something else you need. Cheers, Clint -- Summary: disastrous scheduling for POWER5 Product: gcc Version: 4.2.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: whaley at cs dot utsa dot edu http://gcc.gnu.org/bugzilla/show_bug.cgi?id=32523