Uros, > > Actually, in many cases, SSE did help x86 performance as > well. That > > happens in FP-intensive applications which spend a lot of time in > > loops when the XMM register set can be used more > efficiently than the x87 stack. > > This code could be a perfect example how XMM register file > beats x87 reg stack. > However, contrary to all expectations, x87 code is 20% > faster(!!) /on p4, but it would be interesting to see this > comparison on x86_64, or perhaps on 32bit AMD/. > The code structure, produced with -mfpmath=sse, is the same > as the code structure produced with -mfpmath=x87, so IMO > there is no register allocator effects in play.
I'll look into it and share what I see. > I was trying to look into this problem, but on first sight, > code seems optimal to me... FWIW, here's some old data I got almost 2 years ago (run-times and geometric means of the ratios using SPEC's bases): CPU2000 A B 164.gzip 205s 203s 175.vpr 185s 188s 176.gcc 117s 116s 181.mcf 313s 314s 186.crafty 112s 112s 197.parser 268s 268s 252.eon 147s 167s 253.perlbmk 175s 180s 254.gap 148s 148s 255.vortex 178s 178s 256.bzip2 211s 202s 300.twolf 313s 328s Int Geomean 812 801 177.mesa 173s 187s 179.art 346s 690s 183.equake 163s 162s 188.ammp 325s 336s FP Geomean 757 620 Using GCC 3.3.3 from 3_3-hammer branch with the options for runs in column B were "-m32 -O3 -march=k8 -ffast-math -fomit-frame-pointer -malign-double +FDO", for column A, the same ones plus "-mfpmath=sse". The system was a 1.4GHz Athlon 64 with PC2100 RAM. Because things were so much better with SSE, I haven't run with x87 lately... -- Evandro