------- Comment #23 from whaley at cs dot utsa dot edu  2006-06-27 14:20 -------
Uros,

OK, I made the stupid assumption that the P4 would behave like the P4e,
should've known better :)

I got access to a Pentium 4 (family=15, model=2), and indeed I can repeat the
several surprising things you report:

   (1) SSE does as well as x87 on this platform
   (2) The difference between gcc 3 & 4 x87 performance extremely minor
   (3) The code is amazingly optimal (roughly 95-96% of peak!)

The significance of (3) is that it tells us we are not in the bad case where
the kernel in question gets such crappy performance that all codes look alike. 
This performance was so good, that I ran a tester to verify that we were still
getting the right answer, and indeed we are :)

On this platform, I didn't install the compilers myself, (system had Red Hat
4.0.2-8 and 3.3.6 installed), so I scoped the assembly, and indeed they have
the fmul difference that causes problems on the other x87 machines, so it is
really true that the Pentium 4 handles either instruction stream almost as well
(not sure the 2% is significant; 2% is less than clock resolution, though in my
timings anytime there is a difference, gcc 4 always loses).

Here is the machine breakdown as measured now:
   LIKES GCC 4    DOESN'T CARE    LIKES GCC 3
   ===========    ============    ===========
   CoreDuo        Pentium 4       PentiumPRO
                                  Pentium III
                                  Pentium 4e
                                  Pentium D
                                  Athlon-64 X2
                                  Opteron

The only machine we are missing that I can think of is the K7 (i.e. original
Athlon, not Athlon-64).  I don't presently have access to a K7, but I can
probably find someone on the developer list who could run the test if you like.

The other thing that would be of interest is for each machine to chart the %
performance lost/gained.  Here, though, we want two numbers: % lost on simple
benchmark code (which is easy to repeat), and % lost with ATLAS code generator
(which compares each compiler's best case out of thousands to each other).  I
will undertake to get this first (quick to run) number for the machines so we
have some quantitative results to look at . . .  The ATLAS comparison is
probably more important, but takes so long that maybe I'll post it only for the
most problematic platforms (i.e., if the arch shows a big drop gcc3 v. gcc4,
see if the drop is that big when we ask ATLAS to auto-adapt to gcc4).

Thanks,
Clint


-- 


http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827

Reply via email to