gcc -v Using built-in specs. Target: i686-pc-linux-gnu Configured with: ../gcc-4.0.0/configure --prefix=/usr/local/gcc-4.0.0 Thread model: posix gcc version 4.0.0 chandra.anuradha% gcc -v Using built-in specs. Target: i686-pc-linux-gnu Configured with: ../gcc-4.0.0/configure --prefix=/usr/local/gcc-4.0.0 Thread model: posix gcc version 4.0.0
Compile line: g++ -mmmx -g -O3 test_mmx_diff4.cpp Background: Dirac video codec project uses MMX to speed up the encoding process. When using gcc 3.3.x and gcc-3.4.x there is a performance gain between 20-30% depending on the platform Dirac is built on. However, there was a huge perfomance dip when the Dirac project was built using gcc-4.0.0.In fact, on 32 bit systems the Dirac system performed worse with MMX optimisations enabled than with them turned off. I've incorporated a scaled down version of a Dirac class that uses MMX opts in the attached test_mmx_diff4.cpp and compared the performance of gcc-4.0.0 with gcc-3.4.3 / gcc-3.3.3 on different architectures. The performance comparison results are as follows: Compile line g++ -mmmx -g -O3 test_mmx_diff4.cpp 1. AMD Dual Opteron Processor, Suse 9.2 (32 bit) Results: gcc-3.4.3 gcc-4.0.1 20050503 (prerelease) real 1.25 real 2.87 user 1.24 user 2.87 sys 0.00 sys 0.00 2. Intel Dual Xeon 3.0 GHz, Suse 9.2 64 bit Results: gcc-3.4.3 gcc-4.0.0 real 1.09 real 1.58 user 1.09 user 1.54 sys 0.00 sys 0.00 3. Pentium 4 2.66GHz, Suse 9.2 Results: gcc3.3 20030226 gcc-4.0.0 real 1.35 real 4.98 user 1.32 user 4.96 sys 0.00 sys 0.00 gcc-4.0.0 performed worse than gcc-3.3.3 or gcc3.4.3 even for this simple program. The test results using Dirac were similar to this. I posted a message on the gcc mailing list and here's an excerpt from one of the replies. --- I took a quick look at it. It appears to be a register allocation issue. The gcc mainline compiled code I looked at uses 3 mmx registers, and ends up putting one variable on the stack, thus needing two extra loads and stores in the inner loop. The gcc-3.3.3 compiled code I looked at put everything in registers, using 7 mmx registers, and no unnecessary loads/stores in the inner loop. ---- -- Summary: Performance degradation when building code that uses MMX intrinsics with gcc-4.0.0 Product: gcc Version: 4.0.0 Status: UNCONFIRMED Severity: normal Priority: P2 Component: rtl-optimization AssignedTo: unassigned at gcc dot gnu dot org ReportedBy: asuraparaju at gmail dot com CC: asuraparaju at gmail dot com,gcc-bugs at gcc dot gnu dot org GCC build triplet: i686-pc-linux-gnu GCC host triplet: i686-pc-linux-gnu GCC target triplet: i686-pc-linux-gnu configured with: ../gcc- 4.0.0/configure --pref http://gcc.gnu.org/bugzilla/show_bug.cgi?id=21395