Undefined constant is crashing streams - g++ bug?

2012-04-28 Thread Daniel Marschall

Hello,

I think I have found a bug in G++ . Please submit it to the bug tracker 
(I do not want to open an account there) if you think it is a bug - I am 
not sure about it.


While I worked with "search+replace" I accidently had following in my 
source code:


const char* DUMMY = DUMMY;

It is amaazing that this code actually does compile. And no warning is 
output at all.


The usage of this mysterious constant "DUMMY" causes odd behavior, e.g. 
if it is written to a stream, the stream becomes "broken" and nothing 
can be written to it anymore:


cout << "Hello world" << endl;
cout << DUMMY;
cout << "You cannot read this. I am broken..." << endl;

I also had in mind that the program might have been crashed/terminated, 
but in fact, it is still running. I verified it with a printf() at the 
end and it showed me that the program is still running and just the 
'cout' stream is gone.


Here are some small reproduceable codes:

1. broken.c:


#include 
#include 
int main(void) {
const char* DUMMY = DUMMY;
std::cout << DUMMY;
return 0;
}

working.c:


#include 
#include 
int main(void) {
const char* DUMMY = "x";
std::cout << DUMMY;
return 0;
}

The diff result of broken.asm and working.asm is:
=

--- broken.asm  2012-04-29 08:41:25.0 +0200
+++ working.asm 2012-04-29 08:41:26.0 +0200
@@ -1 +1 @@
-   .file   "broken.c"
+   .file   "working.c"
@@ -3,0 +4,3 @@
+   .section.rodata
+.LC0:
+   .string "x"
@@ -16,0 +20 @@
+   movq$.LC0, -8(%rbp)

A part of working.asm:
==

.LC0:
.string "x"
.text
main:
.LFB957:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq   %rbp
.cfi_def_cfa_offset 16
movq%rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq$16, %rsp
movq$.LC0, -8(%rbp)   // <- this is missing at 
broken.asm!

movq-8(%rbp), %rax
movq%rax, %rsi
movl$_ZSt4cout, %edi
call_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl$0, %eax
leave
ret
.cfi_endproc

A part of broken.asm:
==

main:
.LFB957:
.cfi_startproc
.cfi_personality 0x3,__gxx_personality_v0
pushq   %rbp
.cfi_def_cfa_offset 16
movq%rsp, %rbp
.cfi_offset 6, -16
.cfi_def_cfa_register 6
subq$16, %rsp
movq-8(%rbp), %rax
movq%rax, %rsi
movl$_ZSt4cout, %edi
call_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc
movl$0, %eax
leave
ret
.cfi_endproc

My 'g++ -v' output:



Es werden eingebaute Spezifikationen verwendet.
Ziel: x86_64-linux-gnu
Konfiguriert mit: ../src/configure -v --with-pkgversion='Debian 
4.4.5-8' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs 
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr 
--program-suffix=-4.4 --enable-shared --enable-multiarch 
--enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib 
--without-included-gettext --enable-threads=posix 
--with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib 
--enable-nls --enable-clocale=gnu --enable-libstdcxx-debug 
--enable-objc-gc --with-arch-32=i586 --with-tune=generic 
--enable-checking=release --build=x86_64-linux-gnu 
--host=x86_64-linux-gnu --target=x86_64-linux-gnu

Thread-Modell: posix
gcc-Version 4.4.5 (Debian 4.4.5-8)


(Note: It is the latest version I can get. Since it is a production 
system I cannot install newer unstable versions and I do not have a 
Linux box at home.)


Best regards
Daniel Marschall



Re: G++ could optimize ASM code more

2012-05-09 Thread Daniel Marschall

Hello and thanks for your quick reply!

Am 09.05.2012 15:59, schrieb Ian Lance Taylor:




Note that the current GCC release is 4.7.0.


The problem with Debian Squeeze is always that I have to use "medieval" 
software... ;-) Maybe I should develop the server software on a local 
box using "unstable" software. On the other hand, if I develop directly 
at the production machine, I can directly optimize the program for the 
machine itself and not for my local box/CPU.




This cast changes the meaning of the code, so it's not surprising 
that
you see different assembler instructions.  The first case above will 
do
the multiplication in the type "unsigned long long".  In the second 
case

the "unsigned char" values are zero-extended to int, and the
multiplication is done in the type "int".  Then the "int" result is
sign-extended to "unsigned long long" for the addition.

In this case it's true that the compiler could convert the code as 
you

suggest, based on the knowledge that the int values are always in the
range 0 to 255.



I did understand that the compiler used "signed" multiplication instead 
of an unsigned one because char*char needs to be extended.


Maybe I am wrong, but couldn't the compiler "know" that the result will 
be at least unsigned because unsigned * unsigned = unsigned ?


So it could have extended the multiplication to the unsigned long-long 
datatype of c or at least just "unsigned int" instead of "signed int"?



However, it's not clear to me that using imulq would be
better.  My copy of the Intel optimization manual suggests that imull
has slightly lower latency than imulq, so I think that in many cases
imull would be preferred.


Mh... good point. I do not know much about Assembler so I just thought 
the shorter the code the better. If imull is faster than imulq, then the 
question is, if imull+movslq is still faster than a single imulq. Do you 
know where I can find these informations for my CPU (Intel Xeon X3440)? 
I was searching for a table which shows how many CPU-ticks the imull, 
imulq and movslq need, but yet I have not found one.


My Linux is 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC 2012 x86_64 
GNU/Linux .


And the CPU is "Intel(R) Xeon(R) CPU X3440  @ 2.53GHz". (I hope the 
"amd64" version of Debian is the correct one, or should our admin have 
installed the "ia64" variant since it is an Intel CPU?)


Best regards
Daniel Marschall



Re: G++ could optimize ASM code more

2012-05-09 Thread Daniel Marschall

Hello,


Look for the Intel Optimization Manual on intel.com.  The appendixes
have latency and throughput information for the instruction set on
various Intel processors.


Uh-oh, that's hard. I tried to find the information, but I did only 
found a part of the informations I was looking for.


First, I used -masm=intel to use the Intel syntax and got.

- for the no-typecast-variant (imull):

imulecx, esi   # imull
movsx   rcx, ecx   # movslq

- for the typecast-variant (imulq):

imulrcx, rsi   # imulq

In the Intel manual I collected following informations from Appendix C, 
Table C-16a:


Latency Throughput
0f_3h   0f_2h   0f_3h   0f_2h
imul r3210  14  1   3
imul imm32  -   14  1   3
imul-   15-18   -   5
mov 1   0.5 0.5 0.5
movsb/movsw 1   0.5 0.5 0.5


I have 3 problems:
1. I do not know my DisplayName/DisplayFamily (0f_2h or 0f_3h?).
2. The table does not contain "movsx"
3. Should I compare Latency or Throughput if I want to produce fast 
code? Or doesn't it matter which value I compare?


I assume that movsx has the same latency of movsw (but not sure) and I 
think that "imul" in the table refers to AT&T's "imulq" resp. Intel's 
"imul rcx, rsi" while "imul r32" in the table refers to AT&T's "imull" 
resp. Intel's "imul ecx, esi". Am I right?


Daniel

Am 09.05.2012 20:30, schrieb Ian Lance Taylor:

Daniel Marschall  writes:


I did understand that the compiler used "signed" multiplication
instead of an unsigned one because char*char needs to be extended.

Maybe I am wrong, but couldn't the compiler "know" that the result
will be at least unsigned because unsigned * unsigned = unsigned ?


Well, but the rules of C say that the unsigned char values are
zero-extended to int, and then they are multiplied using a signed
multiplication.  So the result is not unsigned.  The compiler really
would have to do some sort of type or value based reasoning here to
determine that an unsigned multiplication would work also.

Mh... good point. I do not know much about Assembler so I just 
thought

the shorter the code the better.


Sadly, no.



If imull is faster than imulq, then
the question is, if imull+movslq is still faster than a single
imulq. Do you know where I can find these informations for my CPU
(Intel Xeon X3440)? I was searching for a table which shows how many
CPU-ticks the imull, imulq and movslq need, but yet I have not found
one.

My Linux is 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC 2012 
x86_64

GNU/Linux .

And the CPU is "Intel(R) Xeon(R) CPU X3440  @ 2.53GHz". (I hope the
"amd64" version of Debian is the correct one, or should our admin 
have

installed the "ia64" variant since it is an Intel CPU?)



Ian




Re: G++ could optimize ASM code more

2012-05-09 Thread Daniel Marschall

Am 09.05.2012 21:48, schrieb Marc Glisse:

On Wed, 9 May 2012, Daniel Marschall wrote:


1. I do not know my DisplayName/DisplayFamily (0f_2h or 0f_3h?).


Ask your processor (cpuid). Or your kernel (/proc/cpuinfo on linux).


/proc/cpuinfo says:

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 30
model name  : Intel(R) Xeon(R) CPU   X3440  @ 2.53GHz
stepping: 5
...

But I do not know if this is "0f_2h" or "0f_3h" . That's cryptical for 
me.




3. Should I compare Latency or Throughput if I want to produce fast 
code? Or doesn't it matter which value I compare?


Both. And you also need to look at the code that is nearby, not just
this one instruction. In short, don't bother. If you really want to
know, benchmark both versions.


The nearby code is identical. The typecast only changes these two OP 
codes. Yes, I should do a bit benchmarks. It would be a 
long-term-benchmark since the speedup is very fine-graded.


Daniel



Am 09.05.2012 21:48, schrieb Marc Glisse:

On Wed, 9 May 2012, Daniel Marschall wrote:


1. I do not know my DisplayName/DisplayFamily (0f_2h or 0f_3h?).


Ask your processor (cpuid). Or your kernel (/proc/cpuinfo on linux).

3. Should I compare Latency or Throughput if I want to produce fast 
code? Or doesn't it matter which value I compare?


Both. And you also need to look at the code that is nearby, not just
this one instruction. In short, don't bother. If you really want to
know, benchmark both versions.




Re: G++ could optimize ASM code more

2012-05-09 Thread Daniel Marschall


Am 09.05.2012 20:30, schrieb Ian Lance Taylor:

Daniel Marschall  writes:


I did understand that the compiler used "signed" multiplication
instead of an unsigned one because char*char needs to be extended.

Maybe I am wrong, but couldn't the compiler "know" that the result
will be at least unsigned because unsigned * unsigned = unsigned ?


Well, but the rules of C say that the unsigned char values are
zero-extended to int, and then they are multiplied using a signed
multiplication.  So the result is not unsigned.  The compiler really
would have to do some sort of type or value based reasoning here to
determine that an unsigned multiplication would work also.


Hello,

I could sucessfully do a benchmark of my code. I found out that the 
no-typecast-version (imull+movslq) needed 47 secs for 12 working 
packages, while the typecast-version (imulq) needed only 38 secs per 12 
working packages. That is incredible!


Maybe you should still consider preferring imulq instead of 
imull+movslq ?


I wonder if GCC has an optimization which optimizes the machine code 
itself, without knowledge of the underlaying C code, e.g. it could 
eliminate unnecessary mov commands if a register is not used resp. using 
operations which do have lower latency. I think such an "assembler-only" 
optimization still can get additional performance since the rules of the 
underlaying programming language (e.g. the expansion to signed int) can 
be ignored if the end-result is the same. But I fear that this is rather 
a hard task and maybe not possible.


Daniel