------- Comment #44 from whaley at cs dot utsa dot edu  2006-08-07 21:56 -------
Guys,

OK, the mystery of why my hand-patched gcc didn't work is now cleared up.  My
first clue was that neither did the SVN-build gcc!  Turns out, your peephole
opt is only done if I throw the flag -O3 rather than -O, which is what my
tarfile used.  Any reason it's done at only the high levels, since it makes
such a performance difference?

FYI, in gcc3 -O gets better performance than -O3, which is why that's my
default flags.  However, it appears that gcc4 gets very nice performance with
-O3.  Its fairly common for -O to give better performance than -O3, however
(since the ATLAS code is already aggressively optimized, gcc's max optimization
often de-optimize an optimal code), so turning this on at the default level, or
being able to turn it off and on manually would be ideal . . .

>That's why you should compare 4.2 before and after my patch, instead.

Yeah, except 4.2 w/o your patch has horrible performance.  Our goal is not to
beat horrible performance, but rather to get good performance!  Gcc 3 provides
a measure of good performance.  However, I take your point that it'd be nice to
see the new stuff put a headlock on the crap performance, so I include that
below as well :)

Here's some initial data.  I report MFLOPS achieved by the kernel as compiled
by : gcc3 (usually gcc 3.2 or 3.4.3), gccS (current SVN gcc), and gcc4 (usually
gcc 4.1.1).  I will try to get more data later, but this is pretty suggestive,
IMHO.

                              DOUBLE            SINGLE
              PEAK        gcc3/gccS/gcc4    gcc3/gccS/gcc4
              ====        ==============    ==============
Pentium-D :   2800        2359/2417/2067    2685/2684/2362
Ath64-X2  :   5600        3677/3585/2102    3680/3914/2207
Opteron   :   3200        2590/2517/1507    2625/2800/1580

So, it appears to me we are seeing the same pattern I previously saw in my
hand-tuned SSE code: Intel likes the new pattern of doing the last load as part
of the FMUL instruction, but AMD is hampered by it.  Note that gccS is the best
compiler for both single & double on the Intel. On both AMD machines, however,
it wins only for single, where the cost of the load is lower.  It loses to gcc3
for double, where load performance more completely determines matmul
performance.  This is consistant with the view that gcc 4 does some other
optimizations better than gcc 3, and so if we got the fldl removed, gcc 4 would
win for all precisions . . .

Don't get me wrong, your patch has already removed the emergency: in the worst
case so far you are less than 3% slower.  However, I suspect if we added the
optional (for amd chips only) peephole step to get rid of all possible
fmul[s,l], then we'd win for double, and win even more for single on AMD chips
. . .  So, any chance of an AMD-only or flag-controlled peephole step to get
rid of the last fmul[s,l]?

>Or you can disable the fmul[sl] instructions altogether.

As I mentioned, my own hand-tuning has indicated that the final fmul[sl] is
good for Intel netburst archs, but bad for AMD hammer archs.

I'll see about posting some vectorization data ASAP.  Can someone create a new
bug report so that the two threads of inquiry don't get mixed up, or do you
want to just intermix them here?

Thanks,
Clint

P.S.: I tried to run this on the Core by hand-translating gccS-genned assembly
to OS X assembly.  The double precision gccS runs at the same speed as apple's
gcc.  However, the single precision is an order of magnitude slower, as I
experienced this morning on the P4E.  This is almost certainly an error in my
makefile, but damned if I can find it.


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

Reply via email to