------- Comment #23 from whaley at cs dot utsa dot edu 2006-06-27 14:20 ------- Uros,
OK, I made the stupid assumption that the P4 would behave like the P4e, should've known better :) I got access to a Pentium 4 (family=15, model=2), and indeed I can repeat the several surprising things you report: (1) SSE does as well as x87 on this platform (2) The difference between gcc 3 & 4 x87 performance extremely minor (3) The code is amazingly optimal (roughly 95-96% of peak!) The significance of (3) is that it tells us we are not in the bad case where the kernel in question gets such crappy performance that all codes look alike. This performance was so good, that I ran a tester to verify that we were still getting the right answer, and indeed we are :) On this platform, I didn't install the compilers myself, (system had Red Hat 4.0.2-8 and 3.3.6 installed), so I scoped the assembly, and indeed they have the fmul difference that causes problems on the other x87 machines, so it is really true that the Pentium 4 handles either instruction stream almost as well (not sure the 2% is significant; 2% is less than clock resolution, though in my timings anytime there is a difference, gcc 4 always loses). Here is the machine breakdown as measured now: LIKES GCC 4 DOESN'T CARE LIKES GCC 3 =========== ============ =========== CoreDuo Pentium 4 PentiumPRO Pentium III Pentium 4e Pentium D Athlon-64 X2 Opteron The only machine we are missing that I can think of is the K7 (i.e. original Athlon, not Athlon-64). I don't presently have access to a K7, but I can probably find someone on the developer list who could run the test if you like. The other thing that would be of interest is for each machine to chart the % performance lost/gained. Here, though, we want two numbers: % lost on simple benchmark code (which is easy to repeat), and % lost with ATLAS code generator (which compares each compiler's best case out of thousands to each other). I will undertake to get this first (quick to run) number for the machines so we have some quantitative results to look at . . . The ATLAS comparison is probably more important, but takes so long that maybe I'll post it only for the most problematic platforms (i.e., if the arch shows a big drop gcc3 v. gcc4, see if the drop is that big when we ask ATLAS to auto-adapt to gcc4). Thanks, Clint -- http://gcc.gnu.org/bugzilla/show_bug.cgi?id=27827