[Bug c++/50182] New: Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 Bug #: 50182 Summary: Performance degradation from gcc 4.1 (x86_64) Classification: Unclassified Product: gcc Version: 4.6.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ AssignedTo: unassig...@gcc.gnu.org ReportedBy: oleg.smol...@gmail.com G++ 4.6 emits slower code based on the following set of benchmarks: http://stlab.adobe.com/performance/ The discussion thread is here: http://gcc.gnu.org/ml/gcc/2011-07/threads.html#00506 http://gcc.gnu.org/ml/gcc/2011-08/threads.html#00411 I digested one of the tests down to a single short case (see attachments): http://gcc.gnu.org/ml/gcc/2011-08/msg00391.html g++ 4.1 (1.35 sec, 1185M ops/s): .text:00400FE0 loc_400FE0: .text:00400FE0 movzx eax, ds:data8[rdx] .text:00400FE7 add rdx, 1 .text:00400FEB add eax, 0Ah .text:00400FEE cmp rdx, 1F40h .text:00400FF5 lea ecx, [rax+rcx] .text:00400FF8 jnz short loc_400FE0 g++ 4.6 (2.86s, 563M ops/s) : .text:00400D90 loc_400D90: .text:00400D90 add eax, 0Ah .text:00400D93 add al, [rdx] .text:00400D95 add rdx, 1 .text:00400D99 cmp rdx, 503480h .text:00400DA0 jnz short loc_400D90 P.S. setting the component to C++. Optimizer?
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #1 from Oleg Smolsky 2011-08-24 22:13:26 UTC --- Created attachment 25097 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25097 The test case This is the preprocessed source for the test discussed in the mail thread.
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #5 from Oleg Smolsky 2011-08-25 15:19:57 UTC --- Created attachment 25103 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25103 The same test preprocessed with g++ 4.1
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #6 from Oleg Smolsky 2011-08-25 15:25:49 UTC --- Oh, the settings and things were discussed the mail thread... Here is the digest: I have compiled and run a set of C++ benchmarks on a CentOS4/64 box using the following compilers: a) g++4.1 that is available for this distro (GCC version 4.1.2 20071124 (Red Hat 4.1.2-42) b) g++4.6 that I built (stock version 4.6.1) I built the compiler with all the default options (it just has a distinct installation path): ../gcc-%{version}/configure --prefix=/work/tools/gcc46 --enable-languages=c,c++ --with-system-zlib --with-mpfr=/work/tools/mpfr24 --with-gmp=/work/tools/gmp --with-mpc=/work/tools/mpc LD_LIBRARY_PATH=/work/tools/mpfr/lib24:/work/tools/gmp/lib:/work/tools/mpc/lib Tests were compiled with -O2 and -O3, I later added -march=native to 4.6 builds. The processor is Intel quad core something: processor: 0 vendor_id: GenuineIntel cpu family: 6 model: 15 model name: Genuine Intel(R) CPU @ 2.40GHz stepping: 4 cpu MHz: 2393.943 cache size: 4096 KB physical id: 0 siblings: 4 core id: 0 cpu cores: 4 fpu: yes fpu_exception: yes cpuid level: 10 wp: yes flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm pni monitor ds_cpl tm2 cx16 xtpr lahf_lm bogomips: 4793.09 clflush size: 64 cache_alignment: 64 address sizes: 36 bits physical, 48 bits virtual power management:
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #9 from Oleg Smolsky 2011-08-25 16:26:05 UTC --- AFAIK it's a production processor, a couple of years old. From x86info: Family: 6 Model: 15 Stepping: 4 Type: 0 Brand: 0 CPU Model: Core 2 Duo E6600 Original OEM Feature flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh ds acpi mmx fxsr sse sse2 ss ht tm pbe sse3 monitor ds-cpl vmx tm2 ssse3 cx16 xT PR Extended feature flags: SYSCALL xd em64t lahf_lm Cache info L1 Instruction cache: 32KB, 8-way associative. 64 byte line size. L1 Data cache: 32KB, 8-way associative. 64 byte line size. L3 unified cache: 4MB, 16-way associative. 64 byte line size. TLB info Instruction TLB: 4x 4MB page entries, or 8x 2MB pages entries, 4-way assoc.. Instruction TLB: 4K pages, 4-way associative, 128 entries. Data TLB: 4MB pages, 4-way associative, 32 entries L0 Data TLB: 4MB pages, 4-way set associative, 16 entries L0 Data TLB: 4MB pages, 4-way set associative, 16 entries Data TLB: 4K pages, 4-way associative, 256 entries. Data TLB: 4MB pages, 4-way associative, 32 entries 64 byte prefetching. L0 Data TLB: 4MB pages, 4-way set associative, 16 entries L0 Data TLB: 4MB pages, 4-way set associative, 16 entries Data TLB: 4K pages, 4-way associative, 256 entries. The physical package supports 4 logical processors
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #10 from Oleg Smolsky 2011-08-25 22:08:49 UTC --- BTW, the uint16_t test also got slower for the same very reason. Here is the inner-most loop generated by g++4.6: text:00400DA0 loc_400DA0: .text:00400DA0 add eax, 0Ah .text:00400DA3 add ax, [rdx] .text:00400DA6 add rdx, 2 .text:00400DAA cmp rdx, 5092E0h .text:00400DB1 jnz short loc_400DA0
[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182 --- Comment #11 from Oleg Smolsky 2011-08-26 00:48:02 UTC --- Also, I have just built the same suite with GCC version 4.7 that came from ftp://gcc.gnu.org/pub/gcc/snapshots/4.7-20110820/gcc-4.7-20110820.tar.bz2 and the performance degradation remains: gcc41: 0 "int8_t constant add" 1.35 sec 1185.19 M 1.00 gcc47: 0 "int8_t constant add" 2.37 sec 675.11 M 1.00 Note, these are original unmodified tests, not my digested derivatives