Re: More C type errors by default for GCC 14
Jonathan Wakely writes: Wrong. I wouldn't bother replying to you again in this thread, but I feel that as a gcc maintainer I should confirm that Eli S. is right here; and nobody else I know agrees with your definition of extension as "every non-standard aspect of the compiler's behaviour, whether intentional or accidental". That's just silly. GCC's support for implicit int is clearly intentional. I never claimed that accidental GNU CC behavior was part of GNU C. You might not have explicitly stated that, but you have made that argument in this thread. You have asserted that the compiler's behavior, and not its documentation, determines what should be consider a language extension. That assertion when taken to its natural conclusion show support for the idea that "accidental GNU CC behavior" should be considered a language extension, and by becoming a language extension it would be part of GNU C. If the behavior, and not the documentation, determines what is and is not an extension unless it's "accidental behavior", then how is anyone to know what is or is not a GNU C extension?
Re: Will GCC eventually support SSE2 or SSE4.1?
On 5/26/23 02:46, Stefan Kanthak wrote: Hi, compile the following function on a system with Core2 processor (released January 2008) for the 32-bit execution environment: --- demo.c --- int ispowerof2(unsigned long long argument) { return (argument & argument - 1) == 0; } --- EOF --- GCC 13.3: gcc -m32 -O3 demo.c NOTE: -mtune=native is the default! # https://godbolt.org/z/b43cjGdY9 ispowerof2(unsigned long long): movqxmm1, [esp+4] pcmpeqd xmm0, xmm0 paddq xmm0, xmm1 pandxmm0, xmm1 movdedx, xmm0 #pxorxmm1, xmm1 psrlq xmm0, 32 #pcmpeqb xmm0, xmm1 movdeax, xmm0 #pmovmskb eax, xmm0 or edx, eax #cmp al, 255 seteal #seteal movzx eax, al# ret 11 instructions in 40 bytes # 10 instructions in 36 bytes You cannot delete the 'movzx eax, al' instruction. The line "(argument & argument - 1) == 0" must evaluate to a 0 or a 1. The movzx is required to ensure that the upper 24-bits of the eax register are properly zeroed. OOPS: why does GCC (ab)use the SSE2 alias "Willamette New Instruction Set" here instead of the native SSE4.1 alias "Penryn New Instruction Set" of the Core2 (and all later processors)? OUCH: why does it FAIL to REALLY use SSE2, as shown in the comments on the right side? After correcting for the above error, your solution is is the same size as the solution gcc generated. Therefore, the only remaining question would be "Is your solution faster than the code gcc produced?" If you claim it is, I'd like to see evidence supporting that claim. Now add the -mtune=core2 option to EXPLICITLY enable the NATIVE SSE4.1 alias "Penryn New Instruction Set" of the Core2 processor: GCC 13.3: gcc -m32 -mtune=core2 -O3 demo.c # https://godbolt.org/z/svhEoYT11 ispowerof2(unsigned long long): #xor eax, eax movqxmm1, [esp+4] #movq xmm1, [esp+4] pcmpeqd xmm0, xmm0 #pcmpeqq xmm0, xmm0 paddq xmm0, xmm1 #paddqxmm0, xmm1 pandxmm0, xmm1 #ptestxmm0, xmm1 movdedx, xmm0 # psrlq xmm0, 32 # movdeax, xmm0 # or edx, eax # seteal #sete al movzx eax, al# ret#ret 11 instructions in 40 bytes# 7 instructions in 26 bytes OUCH: GCC FAILS to use SSE4.1 as shown in the comments on the right side. ~~~ As pointed out elsewhere in this thread, you used the wrong flags. With the proper flags, I get % gcc -march=x86-64 -msse4.1 -m32 -O3 -c ispowerof2.c && objdump -d ispowerof2.o ispowerof2.o: file format elf32-i386 Disassembly of section .text: : 0: f3 0f 7e 4c 24 04 movq 0x4(%esp),%xmm1 6: 66 0f 76 c0 pcmpeqd %xmm0,%xmm0 a: 31 c0 xor %eax,%eax c: 66 0f d4 c1 paddq %xmm1,%xmm0 10: 66 0f db c1 pand %xmm1,%xmm0 14: 66 0f 6c c0 punpcklqdq %xmm0,%xmm0 18: 66 0f 38 17 c0 ptest %xmm0,%xmm0 1d: 0f 94 c0 sete %al 20: c3 ret so with just the SSE-4.1 instruction set the output is 31 bytes long. Last compile with -mtune=i386 for the i386 processor: GCC 13.3: gcc -m32 -mtune=i386 -O3 demo.c # https://godbolt.org/z/e76W6dsMj ispowerof2(unsigned long long): pushebx# mov ecx, [esp+8] #moveax, [esp+4] mov ebx, [esp+12] #movedx, [esp+8] mov eax, ecx # mov edx, ebx # add eax, -1#addeax, -1 adc edx, -1#adcedx, -1 and eax, ecx #andeax, [esp+4] and edx, ebx #andedx, [esp+8] or eax, edx #or eax, edx seteal #negeax movzx eax, al#sbbeax, eax pop ebx#inceax ret#ret 14 instructions in 33 bytes# 11 instructions in 32 bytes OUCH: why does GCC abuse EBX (and ECX too) and performs a superfluous memory write? At -O1 gcc produces: % gcc -march=x86-64 -mtune=i386 -m32 -O -c ispowerof2.c && objdump -Mintel -d ispowerof2.o ispowerof2.o: file format elf32-i386 Disassembly of section .text: : 0: 8b 44 24 04 mov eax,DWORD PTR [esp+0x4] 4: 8b 54 24 08 mov edx,DWORD PTR [esp+0x8] 8: 83 c0 ff add eax,0x b: 83 d2 ff adc edx,0x e: 23 44 24 04 and eax,DWORD PTR [esp+0x4] 12:
Re: Will GCC eventually support SSE2 or SSE4.1?
On 5/26/23 08:42, Stefan Kanthak wrote: I could have added PROPERLY, because that's where it CLEARLY fails, as shown by the generated unoptimised code. From what I've seen so far, I find your arguments unconvincing. In this thread alone, you've proven that you don't know how to properly control gcc via its command-line flags, and that you don't know how to properly generate assembly code for your own C example (properly in this case meaning to exhibit the behavior the ISO C standard requires) which makes it hard for me to accept your claims at face value (your C example is also logically incorrect, but that's not important to this discussion). That said assuming that your "optimized assembly" examples (with the exception of the first) are correct, all you've done is shown that your versions are slightly smaller in both instruction count and size and declared your examples "proper". The optimization flag -O3 (like most of the -On flags) optimize for speed over all else, and it has been proven that the faster code isn't necessarily the code with fewer instructions or the smallest size (see the RISC v CISC debate). To accept that your suggestions are the proper ways to generate code using SSE4.1 instructions at -O3, I insist on data that clearly demonstrates that your suggestions are at least as performant than what GCC's currently does.
Re: Another epic optimiser failure
On 5/27/23 17:04, Stefan Kanthak wrote: --- .c --- int ispowerof2(unsigned long long argument) { return __builtin_popcountll(argument) == 1; } --- EOF --- GCC 13.3gcc -m32 -march=alderlake -O3 gcc -m32 -march=sapphirerapids -O3 gcc -m32 -mpopcnt -mtune=sapphirerapids -O3 https://gcc.godbolt.org/z/cToYrrYPq ispowerof2(unsigned long long): xor eax, eax# superfluous xor edx, edx# superfluous popcnt eax, [esp+4] popcnt edx, [esp+8] add eax, edx cmp eax, 1 ->dec eax seteal movzx eax, al # superfluous ret 9 instructions in 28 bytes # 6 instructions in 20 bytes I agree this can be done using 6 instructions, but you cannot do it using the dec instruction. If you use the dec instruction, "movzx eax, al" becomes a required instruction (consider the case when the input is 0) resulting in 7 instructions and 22 bytes.
Re: Who cares about performance (or Intel's CPU errata)?
On 5/27/23 18:52, Stefan Kanthak wrote: "Andrew Pinski" wrote: On Sat, May 27, 2023 at 2:25 PM Stefan Kanthak wrote: Just to show how SLOPPY, INCONSEQUENTIAL and INCOMPETENT GCC's developers are: --- dontcare.c --- int ispowerof2(unsigned __int128 argument) { return __builtin_popcountll(argument) + __builtin_popcountll(argument >> 64) == 1; } --- EOF --- GCC 13.3gcc -march=haswell -O3 https://gcc.godbolt.org/z/PPzYsPzMc ispowerof2(unsigned __int128): popcnt rdi, rdi popcnt rsi, rsi add esi, edi xor eax, eax cmp esi, 1 seteal ret OOPS: what about Intel's CPU errata regarding the false dependency on POPCNTs output? Because the popcount is going to the same register, there is no false dependency The false dependency errata only applies if the result of the popcnt is going to a different register, the processor thinks it depends on the result in that register from a previous instruction but it does not (which is why it is called a false dependency). In this case it actually does depend on the previous result since the input is the same as the input. OUCH, my fault; sorry for the confusion and the wrong accusation. Nevertheless GCC fails to optimise code properly: --- .c --- int ispowerof2(unsigned long long argument) { return __builtin_popcountll(argument) == 1; } --- EOF --- GCC 13.3gcc -m32 -mpopcnt -O3 https://godbolt.org/z/fT7a7jP4e ispowerof2(unsigned long long): xor eax, eax xor edx, edx popcnt eax, [esp+4] popcnt edx, [esp+8] add eax, edx # eax is less than 64! Less than or equal to 64 (consider the case when input is (unsigned long long)-1) cmp eax, 1->dec eax # 2 bytes shorter seteal movzx eax, al # superfluous Not when dec is used. Use dec and omit this instruction, you may get a result value of 0xff00 (consider the case when input is (unsigned long long)0). ret 5 bytes and 1 instruction saved; 5 bytes here and there accumulate to kilo- or even megabytes, and they can extend code to cross a cache line or a 16-byte alignment boundary. JFTR: same for "__builtin_popcount(argument) == 1;" and 32-bit argument JFTR: GCC is notorious for generating superfluous MOVZX instructions where its optimiser SHOULD be able see that the value is already less than 256! Stefan
Re: Another epic optimiser failure
On 5/29/23 15:01, Dave Blanchard wrote: He's certainly got a few things wrong from time to time in his zeal, but his overall point seems to stand. Do you have any rebuttals of his argument to present yourself? Or do you prefer to just sit back and wait on "y'all" to do the heavy lifting? He's gotten many details wrong including the proper flags to set for gcc (and the "bad documentation" does not justify all the errors he's made), his hand-generated assembly (I've personally pointed out logic errors in his assembly on more than on occasion), and has failed to provide evidence that his solutions are better. In almost all of his examples, he uses -O3 which is basically the "speed above all else" optimization level. I pointed this out before; I also pointed out that the smallest code (in bytes) with the fewest instructions is not always the fastest. He has not provided any data showing that his solutions result in faster executing code than what gcc produces. He has also raised questions that show a distinct lack of understanding when it comes to storage hierarchy; something I feel one would need to know to properly write fast assembly. Finally, I will admit some of the examples of gcc produced code are a bit suspicious, and probably should be reviewed. In short Stefan is not being taken seriously because he is not presenting himself, or his arguments, in a manner that would convince people to take him seriously. As long as Stefan continues to communicate in such a manner, we're going to see similar such responses from (some of) the gcc devs (unfortunately). The best next steps for Stefan, would be to review the constructive criticism, expand on his examples by providing explanation and proof as to why they're better, and then present these updated findings in the proper manner. Using his first example as my own, take the C code: int ispowerof2(unsigned long long argument) { return (argument & argument - 1) == 0; } when compiled produces: % gcc -m32 -O3 -c ispowerof2.c && objdump -d -Mintel ispowerof2.o ispowerof2.o: file format elf32-i386 Disassembly of section .text: : 0: f3 0f 7e 4c 24 04 movq xmm1,QWORD PTR [esp+0x4] 6: 66 0f 76 c0 pcmpeqd xmm0,xmm0 a: 66 0f d4 c1 paddq xmm0,xmm1 e: 66 0f db c1 pand xmm0,xmm1 12: 66 0f 7e c2 movd edx,xmm0 16: 66 0f 73 d0 20 psrlq xmm0,0x20 1b: 66 0f 7e c0 movd eax,xmm0 1f: 09 c2 or edx,eax 21: 0f 94 c0 sete al 24: 0f b6 c0 movzx eax,al 27: c3 ret Whereas he claims the following is better: movq xmm1, [esp+4] pcmpeqd xmm0, xmm0 paddq xmm0, xmm1 pand xmm0, xmm1 pxor xmm1, xmm1 pcmpeqb xmm0, xmm1 pmovmskb eax, xmm0 cmp al, 255 sete al ret because it has 10 instructions and is 36 bytes long vs the 11 instructions and 40 bytes. However, the rebuttals are 1. his code is wrong (can return values other than 0 or 1) and 2. -O3 doesn't optimize on instruction count or byte size (as an aside: clang's output uses 14 instructions but is only 32 bytes in size -- is it better or worse than gcc's?). Therefore, while he's 1 instruction less and 4 bytes fewer (1 byte fewer if you add the needed correction), he presents no evidence that his solution is actually faster. What he would need to do instead is show proof that his solution is indeed faster than what gcc produces. Afterwards, he would be in a position to represent this data in a proper manner.
Re: How "()" works and why
On 2/9/25 10:38, саша савельев wrote: To whom it may concern Me and my classmates found strange behaviour of «()» (in C++) on IT lesson, but our teacher couldn’t anwer us why it works in this way. After, we had tryed to find out by ourselfs, but we found nothing. We understood HOW it works, but not WHY. Could you explane it for us? std::cout << (std::cout.fixed, std::cout.precision(100), acos(-1)); This code will call all functions in order (we think any combinations of functions will be called in order, so might its behavior is defined), but «()» will return only last value (return from acos(-1)). What you are describing is not the behavior of '(expression)', but the behavior of the comma operator (https://en.wikipedia.org/wiki/Comma_operator). Alexander and his friends