[Bug c++/50182] New: Performance degradation from gcc 4.1 (x86_64)

2011-08-24 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

 Bug #: 50182
   Summary: Performance degradation from gcc 4.1 (x86_64)
Classification: Unclassified
   Product: gcc
   Version: 4.6.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
AssignedTo: unassig...@gcc.gnu.org
ReportedBy: oleg.smol...@gmail.com


G++ 4.6 emits slower code based on the following set of benchmarks:
http://stlab.adobe.com/performance/ 

The discussion thread is here:
http://gcc.gnu.org/ml/gcc/2011-07/threads.html#00506
http://gcc.gnu.org/ml/gcc/2011-08/threads.html#00411

I digested one of the tests down to a single short case (see attachments):
http://gcc.gnu.org/ml/gcc/2011-08/msg00391.html



g++ 4.1 (1.35 sec, 1185M ops/s):

.text:00400FE0 loc_400FE0:
.text:00400FE0 movzx   eax, ds:data8[rdx]
.text:00400FE7 add rdx, 1
.text:00400FEB add eax, 0Ah
.text:00400FEE cmp rdx, 1F40h
.text:00400FF5 lea ecx, [rax+rcx]
.text:00400FF8 jnz short loc_400FE0

g++ 4.6 (2.86s, 563M ops/s) :

.text:00400D90 loc_400D90:
.text:00400D90 add eax, 0Ah
.text:00400D93 add al, [rdx]
.text:00400D95 add rdx, 1
.text:00400D99 cmp rdx, 503480h
.text:00400DA0 jnz short loc_400D90

P.S. setting the component to C++. Optimizer?


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-24 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #1 from Oleg Smolsky  2011-08-24 
22:13:26 UTC ---
Created attachment 25097
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25097
The test case

This is the preprocessed source for the test discussed in the mail thread.


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #5 from Oleg Smolsky  2011-08-25 
15:19:57 UTC ---
Created attachment 25103
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=25103
The same test preprocessed with g++ 4.1


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #6 from Oleg Smolsky  2011-08-25 
15:25:49 UTC ---
Oh, the settings and things were discussed the mail thread... Here is the
digest:

I have compiled and run a set of C++ benchmarks on a CentOS4/64 box using the
following compilers:
 a) g++4.1 that is available for this distro (GCC version 4.1.2 20071124 (Red
Hat 4.1.2-42)
 b) g++4.6 that I built (stock version 4.6.1)

I built the compiler with all the default options (it just has a distinct
installation path):
 ../gcc-%{version}/configure --prefix=/work/tools/gcc46
--enable-languages=c,c++ --with-system-zlib --with-mpfr=/work/tools/mpfr24
--with-gmp=/work/tools/gmp --with-mpc=/work/tools/mpc
LD_LIBRARY_PATH=/work/tools/mpfr/lib24:/work/tools/gmp/lib:/work/tools/mpc/lib

Tests were compiled with -O2 and -O3, I later added -march=native to 4.6
builds.

The processor is Intel quad core something:

processor: 0
vendor_id: GenuineIntel
cpu family: 6
model: 15
model name: Genuine Intel(R) CPU  @ 2.40GHz
stepping: 4
cpu MHz: 2393.943
cache size: 4096 KB
physical id: 0
siblings: 4
core id: 0
cpu cores: 4
fpu: yes
fpu_exception: yes
cpuid level: 10
wp: yes
flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx lm pni monitor
ds_cpl tm2 cx16 xtpr lahf_lm
bogomips: 4793.09
clflush size: 64
cache_alignment: 64
address sizes: 36 bits physical, 48 bits virtual
power management:


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #9 from Oleg Smolsky  2011-08-25 
16:26:05 UTC ---
AFAIK it's a production processor, a couple of years old. From x86info:

Family: 6 Model: 15 Stepping: 4 Type: 0 Brand: 0
CPU Model: Core 2 Duo E6600 Original OEM
Feature flags:
 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflsh
ds acpi mmx fxsr sse sse2 ss ht tm pbe sse3 monitor ds-cpl vmx tm2 ssse3 cx16
xT
PR
Extended feature flags:
 SYSCALL xd em64t lahf_lm
Cache info
 L1 Instruction cache: 32KB, 8-way associative. 64 byte line size.
 L1 Data cache: 32KB, 8-way associative. 64 byte line size.
 L3 unified cache: 4MB, 16-way associative. 64 byte line size.
TLB info
 Instruction TLB: 4x 4MB page entries, or 8x 2MB pages entries, 4-way assoc..
 Instruction TLB: 4K pages, 4-way associative, 128 entries.
 Data TLB: 4MB pages, 4-way associative, 32 entries
 L0 Data TLB: 4MB pages, 4-way set associative, 16 entries
 L0 Data TLB: 4MB pages, 4-way set associative, 16 entries
 Data TLB: 4K pages, 4-way associative, 256 entries.
 Data TLB: 4MB pages, 4-way associative, 32 entries
 64 byte prefetching.
 L0 Data TLB: 4MB pages, 4-way set associative, 16 entries
 L0 Data TLB: 4MB pages, 4-way set associative, 16 entries
 Data TLB: 4K pages, 4-way associative, 256 entries.
The physical package supports 4 logical processors


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #10 from Oleg Smolsky  2011-08-25 
22:08:49 UTC ---
BTW, the uint16_t test also got slower for the same very reason. Here is the
inner-most loop generated by g++4.6:

text:00400DA0 loc_400DA0:
.text:00400DA0 add eax, 0Ah
.text:00400DA3 add ax, [rdx]
.text:00400DA6 add rdx, 2
.text:00400DAA cmp rdx, 5092E0h
.text:00400DB1 jnz short loc_400DA0


[Bug target/50182] Performance degradation from gcc 4.1 (x86_64)

2011-08-25 Thread oleg.smolsky at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=50182

--- Comment #11 from Oleg Smolsky  2011-08-26 
00:48:02 UTC ---
Also, I have just built the same suite with GCC version 4.7 that came from
ftp://gcc.gnu.org/pub/gcc/snapshots/4.7-20110820/gcc-4.7-20110820.tar.bz2 and
the performance degradation remains:

gcc41:
0 "int8_t constant add"   1.35 sec   1185.19 M 1.00

gcc47:
0 "int8_t constant add"   2.37 sec   675.11 M 1.00

Note, these are original unmodified tests, not my digested derivatives