Undefined constant is crashing streams - g++ bug?
Hello, I think I have found a bug in G++ . Please submit it to the bug tracker (I do not want to open an account there) if you think it is a bug - I am not sure about it. While I worked with "search+replace" I accidently had following in my source code: const char* DUMMY = DUMMY; It is amaazing that this code actually does compile. And no warning is output at all. The usage of this mysterious constant "DUMMY" causes odd behavior, e.g. if it is written to a stream, the stream becomes "broken" and nothing can be written to it anymore: cout << "Hello world" << endl; cout << DUMMY; cout << "You cannot read this. I am broken..." << endl; I also had in mind that the program might have been crashed/terminated, but in fact, it is still running. I verified it with a printf() at the end and it showed me that the program is still running and just the 'cout' stream is gone. Here are some small reproduceable codes: 1. broken.c: #include #include int main(void) { const char* DUMMY = DUMMY; std::cout << DUMMY; return 0; } working.c: #include #include int main(void) { const char* DUMMY = "x"; std::cout << DUMMY; return 0; } The diff result of broken.asm and working.asm is: = --- broken.asm 2012-04-29 08:41:25.0 +0200 +++ working.asm 2012-04-29 08:41:26.0 +0200 @@ -1 +1 @@ - .file "broken.c" + .file "working.c" @@ -3,0 +4,3 @@ + .section.rodata +.LC0: + .string "x" @@ -16,0 +20 @@ + movq$.LC0, -8(%rbp) A part of working.asm: == .LC0: .string "x" .text main: .LFB957: .cfi_startproc .cfi_personality 0x3,__gxx_personality_v0 pushq %rbp .cfi_def_cfa_offset 16 movq%rsp, %rbp .cfi_offset 6, -16 .cfi_def_cfa_register 6 subq$16, %rsp movq$.LC0, -8(%rbp) // <- this is missing at broken.asm! movq-8(%rbp), %rax movq%rax, %rsi movl$_ZSt4cout, %edi call_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc movl$0, %eax leave ret .cfi_endproc A part of broken.asm: == main: .LFB957: .cfi_startproc .cfi_personality 0x3,__gxx_personality_v0 pushq %rbp .cfi_def_cfa_offset 16 movq%rsp, %rbp .cfi_offset 6, -16 .cfi_def_cfa_register 6 subq$16, %rsp movq-8(%rbp), %rax movq%rax, %rsi movl$_ZSt4cout, %edi call_ZStlsISt11char_traitsIcEERSt13basic_ostreamIcT_ES5_PKc movl$0, %eax leave ret .cfi_endproc My 'g++ -v' output: Es werden eingebaute Spezifikationen verwendet. Ziel: x86_64-linux-gnu Konfiguriert mit: ../src/configure -v --with-pkgversion='Debian 4.4.5-8' --with-bugurl=file:///usr/share/doc/gcc-4.4/README.Bugs --enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr --program-suffix=-4.4 --enable-shared --enable-multiarch --enable-linker-build-id --with-system-zlib --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --with-gxx-include-dir=/usr/include/c++/4.4 --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc --with-arch-32=i586 --with-tune=generic --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu Thread-Modell: posix gcc-Version 4.4.5 (Debian 4.4.5-8) (Note: It is the latest version I can get. Since it is a production system I cannot install newer unstable versions and I do not have a Linux box at home.) Best regards Daniel Marschall
Re: G++ could optimize ASM code more
Hello and thanks for your quick reply! Am 09.05.2012 15:59, schrieb Ian Lance Taylor: Note that the current GCC release is 4.7.0. The problem with Debian Squeeze is always that I have to use "medieval" software... ;-) Maybe I should develop the server software on a local box using "unstable" software. On the other hand, if I develop directly at the production machine, I can directly optimize the program for the machine itself and not for my local box/CPU. This cast changes the meaning of the code, so it's not surprising that you see different assembler instructions. The first case above will do the multiplication in the type "unsigned long long". In the second case the "unsigned char" values are zero-extended to int, and the multiplication is done in the type "int". Then the "int" result is sign-extended to "unsigned long long" for the addition. In this case it's true that the compiler could convert the code as you suggest, based on the knowledge that the int values are always in the range 0 to 255. I did understand that the compiler used "signed" multiplication instead of an unsigned one because char*char needs to be extended. Maybe I am wrong, but couldn't the compiler "know" that the result will be at least unsigned because unsigned * unsigned = unsigned ? So it could have extended the multiplication to the unsigned long-long datatype of c or at least just "unsigned int" instead of "signed int"? However, it's not clear to me that using imulq would be better. My copy of the Intel optimization manual suggests that imull has slightly lower latency than imulq, so I think that in many cases imull would be preferred. Mh... good point. I do not know much about Assembler so I just thought the shorter the code the better. If imull is faster than imulq, then the question is, if imull+movslq is still faster than a single imulq. Do you know where I can find these informations for my CPU (Intel Xeon X3440)? I was searching for a table which shows how many CPU-ticks the imull, imulq and movslq need, but yet I have not found one. My Linux is 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC 2012 x86_64 GNU/Linux . And the CPU is "Intel(R) Xeon(R) CPU X3440 @ 2.53GHz". (I hope the "amd64" version of Debian is the correct one, or should our admin have installed the "ia64" variant since it is an Intel CPU?) Best regards Daniel Marschall
Re: G++ could optimize ASM code more
Hello, Look for the Intel Optimization Manual on intel.com. The appendixes have latency and throughput information for the instruction set on various Intel processors. Uh-oh, that's hard. I tried to find the information, but I did only found a part of the informations I was looking for. First, I used -masm=intel to use the Intel syntax and got. - for the no-typecast-variant (imull): imulecx, esi # imull movsx rcx, ecx # movslq - for the typecast-variant (imulq): imulrcx, rsi # imulq In the Intel manual I collected following informations from Appendix C, Table C-16a: Latency Throughput 0f_3h 0f_2h 0f_3h 0f_2h imul r3210 14 1 3 imul imm32 - 14 1 3 imul- 15-18 - 5 mov 1 0.5 0.5 0.5 movsb/movsw 1 0.5 0.5 0.5 I have 3 problems: 1. I do not know my DisplayName/DisplayFamily (0f_2h or 0f_3h?). 2. The table does not contain "movsx" 3. Should I compare Latency or Throughput if I want to produce fast code? Or doesn't it matter which value I compare? I assume that movsx has the same latency of movsw (but not sure) and I think that "imul" in the table refers to AT&T's "imulq" resp. Intel's "imul rcx, rsi" while "imul r32" in the table refers to AT&T's "imull" resp. Intel's "imul ecx, esi". Am I right? Daniel Am 09.05.2012 20:30, schrieb Ian Lance Taylor: Daniel Marschall writes: I did understand that the compiler used "signed" multiplication instead of an unsigned one because char*char needs to be extended. Maybe I am wrong, but couldn't the compiler "know" that the result will be at least unsigned because unsigned * unsigned = unsigned ? Well, but the rules of C say that the unsigned char values are zero-extended to int, and then they are multiplied using a signed multiplication. So the result is not unsigned. The compiler really would have to do some sort of type or value based reasoning here to determine that an unsigned multiplication would work also. Mh... good point. I do not know much about Assembler so I just thought the shorter the code the better. Sadly, no. If imull is faster than imulq, then the question is, if imull+movslq is still faster than a single imulq. Do you know where I can find these informations for my CPU (Intel Xeon X3440)? I was searching for a table which shows how many CPU-ticks the imull, imulq and movslq need, but yet I have not found one. My Linux is 2.6.32-5-amd64 #1 SMP Mon Jan 16 16:22:28 UTC 2012 x86_64 GNU/Linux . And the CPU is "Intel(R) Xeon(R) CPU X3440 @ 2.53GHz". (I hope the "amd64" version of Debian is the correct one, or should our admin have installed the "ia64" variant since it is an Intel CPU?) Ian
Re: G++ could optimize ASM code more
Am 09.05.2012 21:48, schrieb Marc Glisse: On Wed, 9 May 2012, Daniel Marschall wrote: 1. I do not know my DisplayName/DisplayFamily (0f_2h or 0f_3h?). Ask your processor (cpuid). Or your kernel (/proc/cpuinfo on linux). /proc/cpuinfo says: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 30 model name : Intel(R) Xeon(R) CPU X3440 @ 2.53GHz stepping: 5 ... But I do not know if this is "0f_2h" or "0f_3h" . That's cryptical for me. 3. Should I compare Latency or Throughput if I want to produce fast code? Or doesn't it matter which value I compare? Both. And you also need to look at the code that is nearby, not just this one instruction. In short, don't bother. If you really want to know, benchmark both versions. The nearby code is identical. The typecast only changes these two OP codes. Yes, I should do a bit benchmarks. It would be a long-term-benchmark since the speedup is very fine-graded. Daniel Am 09.05.2012 21:48, schrieb Marc Glisse: On Wed, 9 May 2012, Daniel Marschall wrote: 1. I do not know my DisplayName/DisplayFamily (0f_2h or 0f_3h?). Ask your processor (cpuid). Or your kernel (/proc/cpuinfo on linux). 3. Should I compare Latency or Throughput if I want to produce fast code? Or doesn't it matter which value I compare? Both. And you also need to look at the code that is nearby, not just this one instruction. In short, don't bother. If you really want to know, benchmark both versions.
Re: G++ could optimize ASM code more
Am 09.05.2012 20:30, schrieb Ian Lance Taylor: Daniel Marschall writes: I did understand that the compiler used "signed" multiplication instead of an unsigned one because char*char needs to be extended. Maybe I am wrong, but couldn't the compiler "know" that the result will be at least unsigned because unsigned * unsigned = unsigned ? Well, but the rules of C say that the unsigned char values are zero-extended to int, and then they are multiplied using a signed multiplication. So the result is not unsigned. The compiler really would have to do some sort of type or value based reasoning here to determine that an unsigned multiplication would work also. Hello, I could sucessfully do a benchmark of my code. I found out that the no-typecast-version (imull+movslq) needed 47 secs for 12 working packages, while the typecast-version (imulq) needed only 38 secs per 12 working packages. That is incredible! Maybe you should still consider preferring imulq instead of imull+movslq ? I wonder if GCC has an optimization which optimizes the machine code itself, without knowledge of the underlaying C code, e.g. it could eliminate unnecessary mov commands if a register is not used resp. using operations which do have lower latency. I think such an "assembler-only" optimization still can get additional performance since the rules of the underlaying programming language (e.g. the expansion to signed int) can be ignored if the end-result is the same. But I fear that this is rather a hard task and maybe not possible. Daniel