[Bug c/99339] New: Poor codegen with simple varargs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339 Bug ID: 99339 Summary: Poor codegen with simple varargs Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- These two functions should generate the same code, but the varargs version is much worse. And it is even worser still when you enable -fstack-protector-strong. https://godbolt.org/z/noEYoh #include int test_va(int x, ...) { int i; va_list va; va_start(va, x); i = va_arg(va, int); va_end(va); return i + x; } int test_args(int x, int i) { return i + x; } # explicit args with or without stack protection test_args: lea eax, [rsi+rdi] ret # without stack protection (why aren't dead stores to the stack being eliminated?) test_va: lea rax, [rsp+8] mov QWORD PTR [rsp-40], rsi mov QWORD PTR [rsp-64], rax lea rax, [rsp-48] mov QWORD PTR [rsp-56], rax mov eax, DWORD PTR [rsp-40] mov DWORD PTR [rsp-72], 8 add eax, edi ret # with stack protection (yikes!) test_va: sub rsp, 88 mov QWORD PTR [rsp+40], rsi mov rax, QWORD PTR fs:40 mov QWORD PTR [rsp+24], rax xor eax, eax lea rax, [rsp+96] mov DWORD PTR [rsp], 8 mov QWORD PTR [rsp+8], rax lea rax, [rsp+32] mov QWORD PTR [rsp+16], rax mov eax, DWORD PTR [rsp+40] add eax, edi mov rdx, QWORD PTR [rsp+24] sub rdx, QWORD PTR fs:40 jne .L7 add rsp, 88 ret .L7: call__stack_chk_fail
[Bug middle-end/99339] Poor codegen with simple varargs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339 --- Comment #5 from Mathias Stearn --- I filed a related bug with clang right after this one if anyone want to follow along https://bugs.llvm.org/show_bug.cgi?id=49395. Just because clang does worse doesn't mean gcc shouldn't do better ;)
[Bug libstdc++/95079] unorderd_map::insert_or_assign and try_emplace should only hash and mod once unless there is a rehash.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95079 --- Comment #5 from Mathias Stearn --- @François Dumont: Sorry I didn't see your question earlier. The reason that unordered_map perf hurts on 64-bit platforms is because it is designed to do a size_t modulus-by-prime on every lookup, and on most platforms that is *very* expensive (up to 100 cycles for 64 bits vs 20ish for 32 bits). Some very modern CPUs have made improvements here, but it is still much more expensive than just using power-of-2 buckets and masking, even if you need to hash the hash if you don't trust the low order bits to have enough entropy. Unfortunately, fixing this is a pretty big ABI break, so it isn't going to change any time soon.
[Bug middle-end/99339] Poor codegen with simple varargs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339 --- Comment #6 from Mathias Stearn --- > The question is how common in the wild it is and if it is worth the work. I'm not sure how common it is, but this is my use case. The code in the bug report is a slightly simplified example from some Real World Code I am working on: https://source.wiredtiger.com/develop/struct_w_t___c_u_r_s_o_r.html#af19f6f9d9c7fc248ab38879032620b2f. Essentially WT_CURSOR is a structure of function pointers that all take a WT_CURSOR pointer as the first argument, similar to C++ virtual functions. The get_key() and get_value() "methods" both take (WT_CURSOR*, ...) arguments, and unpack the arguments based on the format string that was set up when you opened the cursor. The expectation is that they will be called many times for each cursor object. Since we know the format string when creating the cursor I was experimenting with creating specialized functions for common formats rather than dispatching through the generic format-inspecting logic every time. I noticed that I was able to get even more performance by declaring the functions as taking the arguments they actually take rather than dealing with va_args, then casting the function pointers to the generic (WT_CURSOR, ...) type when assigning into the method slot. I assume this is UB in C (I don't know the C standard nearly as well as C++) but all ABIs I know of are designed to make this kind of thing work. I'd rather not have to do that kind of questionable shenanigans in order to get the same performance.
[Bug c++/99470] New: Convert fixed index addition to array address offset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99470 Bug ID: 99470 Summary: Convert fixed index addition to array address offset Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- These two functions do the same thing but f() is the cleaner source code (especially when arr is a std::array) while g() generates better code: https://gcc.godbolt.org/z/vTT399 #include inline int8_t arr[256]; bool f(int a, int b) { return arr[a+128] == arr[b+128]; } bool g(int a, int b) { return (arr+128)[a] == (arr+128)[b]; } f(int, int): sub edi, -128 sub esi, -128 lea rax, arr[rip] movsx rdi, edi movsx rsi, esi movzx edx, BYTE PTR [rax+rsi] cmp BYTE PTR [rax+rdi], dl seteal ret g(int, int): lea rax, arr[rip+128] movsx rdi, edi movsx rsi, esi movzx edx, BYTE PTR [rax+rsi] cmp BYTE PTR [rax+rdi], dl seteal ret In addition to only doing the +128 once, it also ends up being completely free in g() because the assembler (or linker?) folds the addition into the address calculation by adjusting the offset of the rip-relative address. In the godbolt link, you can see that when compiled to binary, the LEA instruction uses the same form in both f() and g(), so the addition really is free in g().
[Bug rtl-optimization/99470] Convert fixed index addition to array address offset
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99470 Mathias Stearn changed: What|Removed |Added Status|RESOLVED|UNCONFIRMED Resolution|INVALID |--- --- Comment #4 from Mathias Stearn --- Yes, but I believe any case where they would access different addresses would be UB overflow in f(), making it valid to turn f() into g(), especially if you used an internal lowering which sign extended index to pointer width and had defined wrapping semantics. I'll note that clang already generates identical code for f() and g() https://gcc.godbolt.org/z/j897sh, although I think gcc has better codegen at least for g(). Also, my example was perhaps oversimplified. My indexes were actually int8_t (which is why I'm indexing into a 256-element array), so due to int promotion, overflow is actually impossible. However, with int8_t arguments, gcc generates even worse code for f(), doing the sign-extension twice for some reason (8 -> 32 -> 64): https://gcc.godbolt.org/z/5r9h89 (I hope it isn't a faux pas to reopen the ticket, but I think I've provided enough new information that this warrants another look)
[Bug libstdc++/99979] New: condition_variable_any has wrong behavior if Lock::lock() throws
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99979 Bug ID: 99979 Summary: condition_variable_any has wrong behavior if Lock::lock() throws Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://eel.is/c++draft/thread.condvarany.wait says "Postconditions: lock is locked by the calling thread. […] Remarks: If the function fails to meet the postcondition, terminate() is invoked […] This can happen if the re-locking of the mutex throws an exception." This wording was changed in C++14 for https://cplusplus.github.io/LWG/issue2135, but given that it is borderline impossible to use a condition_variable[_any] correctly if it doesn't guarantee relocking, it seems best to do the same in C++11 mode as well. Unfortunately, std::condition_variable_any::__Unlock::~__Unlock() either ignores[1] or propagates any exception thrown by lock(), which means that the postcondition of wait() may not be fulfilled. I think the correct behavior would be to remove the noexcept(false), allowing the destructor's default noexecpt to apply, and just unconditionally call _M_lock.lock() without any try/catch. [1] It ignores if std::uncaught_exception() returns true. I'm not 100% sure why that dispatch is there. It seems like it may be trying to detect if _M_cond.wait() threw, but a) that is noexcept and b) that doesn't work correctly if some thread waits on a condvar inside a destructor that is triggered during exception unwinding (this is why std::uncaught_exceptions() was introduced).
[Bug c++/98639] GCC accepts cast from Base to Derived in C++20 mode
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98639 Mathias Stearn changed: What|Removed |Added CC||redbeard0531 at gmail dot com --- Comment #4 from Mathias Stearn --- This really seems like it *is* supposed to be allowed based on simply following the normal language rules for aggregates. It would probably require a special case rule to prevent it from working. Is there something that leads you to think that such a special case rule was forgotten, rather than the natural outcome being expected?
[Bug libstdc++/99021] New: coroutine_handle<_Promise>::from_address() should be noexcept
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99021 Bug ID: 99021 Summary: coroutine_handle<_Promise>::from_address() should be noexcept Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- While it isn't required by the standard, it seems odd that coroutine_handle::from_address() is noexcept while the non-specialized version isn't when they are the same code. This would also help with clang/libstdc++ compatibility because they currently check for noexcept of from_address() as part of validating https://eel.is/c++draft/dcl.fct.def.coroutine#15 (I'll be filing a bug with them about that shortly since that is leaking an internal compiler implementation detail).
[Bug libstdc++/99021] coroutine_handle<_Promise>::from_address() should be noexcept
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99021 --- Comment #1 from Mathias Stearn --- clang bug for reference https://bugs.llvm.org/show_bug.cgi?id=49109
[Bug libstdc++/99021] coroutine_handle<_Promise>::from_address() should be noexcept
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99021 --- Comment #3 from Mathias Stearn --- Thanks for the quick fix! Any chance of a backport of this and https://github.com/gcc-mirror/gcc/commit/f1b6e46c417224887c2f21baa6d4c538a25fe9fb#diff-ed08df78eba81189642b4e8d670a0adb4b377db6846aecb090b4dce52a9251fa to the v10 branch? It will improve the experience of anyone using clang-based editor tooling when using libstdc++ rather than libc++.
[Bug c++/99047] New: ICE on simple task coroutine example
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99047 Bug ID: 99047 Summary: ICE on simple task coroutine example Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/ecxEbb Code compiles on MSVC and clang, and at least on clang produces the expected output. #include #include template struct [[nodiscard]] task { struct promise_type { std::suspend_always initial_suspend() { return {}; } auto final_suspend() noexcept { struct awaiter { std::false_type await_ready() noexcept { return {}; } std::coroutine_handle<> await_suspend(std::coroutine_handle<>) noexcept { return next; } void await_resume() noexcept { } std::coroutine_handle<> next; }; return awaiter{next}; } void unhandled_exception() noexcept { std::terminate(); } auto get_return_object() { return task(this); } auto coro() { return std::coroutine_handle::from_promise(*this); } void return_value(T val) { result.emplace(std::move(val)); } std::coroutine_handle<> next; std::optional result; }; task(task&& source) : p(std::exchange(source.p, nullptr)) {} explicit task(promise_type* p) : p(p) {} ~task() { if (p) p->coro().destroy(); } bool await_ready() noexcept { return p->coro().done(); } std::coroutine_handle<> await_suspend(std::coroutine_handle<> next) noexcept { p->next = next; return p->coro(); } const T& await_resume() const& noexcept { return *p->result; } promise_type* p; }; task five() { co_return 5; } task six() { co_return (co_await five()) + 1; } int main() { auto task = six(); task.p->next = std::noop_coroutine(); task.p->coro().resume(); return *task.p->result; } : In function 'void _Z4fivev.actor(five()::_Z4fivev.frame*)': :92:11: internal compiler error: in fold_convert_loc, at fold-const.c:2430 92 | task five() { | ^~~~ 0x1ce6f09 internal_error(char const*, ...) ???:0 0x6b6f43 fancy_abort(char const*, int, char const*) ???:0 0xc92cd4 fold_convert_loc(unsigned int, tree_node*, tree_node*) ???:0 0xd126ae gimple_boolify(tree_node*) ???:0 0xd1b720 gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) ???:0 0xd1bb4a gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*), int) [snipping many more frames]
[Bug c++/100876] New: -Wmismatched-new-delete should either look through or ignore placement new
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100876 Bug ID: 100876 Summary: -Wmismatched-new-delete should either look through or ignore placement new Product: gcc Version: 11.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/KTTMrEGns Example code: free(new (malloc(4)) int()); // Warns but shouldn't delete new (malloc(4)) int(); // Doesn't warn but should output: :5:9: warning: 'void free(void*)' called on pointer returned from a mismatched allocation function [-Wmismatched-new-delete] 5 | free(new (malloc(4)) int()); // Warns but shouldn't | ^~~ :5:30: note: returned from 'void* operator new(std::size_t, void*)' 5 | free(new (malloc(4)) int()); // Warns but shouldn't | ^ While it would be nice to have a warning on the second line, not warning on the first seems more important. And hopefully is a backportable fix. Here is some Real World Code exhibiting this pattern that g++ currently warns about when compiling: https://github.com/facebook/hermes/blob/dfef1abd6d20b196e24c591e225a7003e6337a94/unittests/VMRuntime/StringPrimitiveTest.cpp#L221-L235. There is also an example using calloc() lower in that file.
[Bug c++/89997] Garbled expression in error message with -fconcepts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89997 --- Comment #3 from Mathias Stearn --- Please reopen. It still seems to be broken with -std=c++20 as the only flag: https://godbolt.org/z/bWMq4s6xb (trunk) https://godbolt.org/z/W3xWjWaGe (12.2) Output: : In function 'void test()': :16:15: error: no matching function for call to 'check()' 16 | check(); // mangled error | ~~^~ :12:6: note: candidate: 'template void check() requires requires(X x, T val) {x.X::operator<<(const char*)("hello") << val;}' 12 | void check() requires requires (X x, T val) { x << "hello" << val; } {} | ^ :12:6: note: template argument deduction/substitution failed: :12:6: note: constraints not satisfied : In substitution of 'template void check() requires requires(X x, T val) {x.X::operator<<(const char*)("hello") << val;} [with T = int]': :16:15: required from here :12:6: required by the constraints of 'template void check() requires requires(X x, T val) {x.X::operator<<(const char*)("hello") << val;}' :12:23: in requirements with 'X x', 'T val' [with T = int] :12:60: note: the required expression '("hello"->x.X::operator<<() << val)' is invalid 12 | void check() requires requires (X x, T val) { x << "hello" << val; } {} | ~^~ cc1plus: note: set '-fconcepts-diagnostics-depth=' to at least 2 for more detail Compiler returned: 1 The last line with still says "the required expression '("hello"->x.X::operator<<() << val)' is invalid". It should not be trying to apply -> to a string literal. The following line with carrot and underline is very helpful and shows what the problem is. But the "note" line seems actively harmful since it is showing an expression that would never be valid for any type. It seems like it would be better to remove that line than attempting to show it if you can't reproduce the correct expression.
[Bug c++/89997] Garbled expression in error message with -fconcepts
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89997 --- Comment #6 from Mathias Stearn --- > I think this is probably not concepts-specific, and just another variant of > PR 49152. Perhaps the busted pretty printer is a general problem, but at least in this case I think the fix may be in concepts code. It looks like the error is generated at https://github.com/gcc-mirror/gcc/blob/16e2427f50c208dfe07d07f18009969502c25dc8/gcc/cp/constraint.cc#L3300 (or the similar call 7 lines lower). Given that gcc already prints the source loc with the invalid expression, I think you can just remove the %qE to improve the diagnostic output. (I don't know the gcc codebase at all, so I could be wrong about this, I just grepped for the string "the required expression")
[Bug libstdc++/104223] New: GCC unable to inline trivial functions passed to views::filter and transform unless lifted into types
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104223 Bug ID: 104223 Summary: GCC unable to inline trivial functions passed to views::filter and transform unless lifted into types Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- I'm not sure if this is an issue with the optimizer or the way that the library code is written, or some combination of the two, but the end result seems unfortunate. with() and without() are logically the same function, the only difference is that with() lifts the function pointers into types using a templated lambda variable, while without() just passes the function names directly to the library. It seems interesting that the optimizer knows that they are "constant enough" to emit direct rather than indirect calls to t() and f(), however, it isn't constant enough to inline those calls. https://godbolt.org/z/EqWzzh916 Flags: -std=c++2b -O3 Reproduces on at least 11.2 and trunk #include namespace views = std::views; void trace() noexcept; inline int f(int i) { trace(); return i; } inline bool t(int) { return true; } // for some reason gcc needs this in order to inline f() and t() template auto typify = [] (int i) { return f(i); }; void with() { for (auto&& x : views::single(1) | views::transform(typify) | views::filter(typify)) {} } void without() { for (auto&& x : views::single(1) | views::transform(f) | views::filter(t)) {} } with(): sub rsp, 8 calltrace() add rsp, 8 jmp trace() without(): sub rsp, 8 mov edi, 1 callf(int) mov edi, eax callt(int) testal, al jne .L10 add rsp, 8 ret .L10: mov edi, 1 add rsp, 8 jmp f(int)
[Bug c++/102363] New: source_location in await_transform has function_name referring to internal coroutine funclet rather than source-level function
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102363 Bug ID: 102363 Summary: source_location in await_transform has function_name referring to internal coroutine funclet rather than source-level function Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- When a promise_type::await_transform() has a defaulted argument of source_location::current(), it refers to the location of the co_await that triggered the await_transform. This is really useful for building runtime diagnostics and tracing. The file/line/column fields all seem to be filled out correctly, however, the function_name field refers to an internal funclet that the coroutine function is split into. I think it would be more useful if it had the original corountine function name since that is what users will be expecting. Example: https://godbolt.org/z/sbMsPG8Eq Output: good: /app/example.cpp:20:43: example func() bad: /app/example.cpp:25:14: void func(func()::_Z4funcv.frame*) Source (compiled with just -std=c++20): #include #include #include struct example { struct promise_type { std::suspend_never initial_suspend() { return {}; } std::suspend_never final_suspend() noexcept { return {}; } void unhandled_exception() { throw; } example get_return_object() { return {}; } void return_void() {} auto await_transform(const char* dummy, std::source_location l = std::source_location::current()) { printf("bad: %s:%u:%u: %s\n", l.file_name(), l.line(), l.column(), l.function_name()); return std::suspend_never(); } }; }; example func() { auto l = std::source_location::current(); printf("good: %s:%u:%u: %s\n", l.file_name(), l.line(), l.column(), l.function_name()); // bad location points here: // v co_await "pretend this is an awaitable"; } int main() { func(); }
[Bug web/102365] New: Function attribute docs should have an anchor or id on each attribute.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102365 Bug ID: 102365 Summary: Function attribute docs should have an anchor or id on each attribute. Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: web Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html This would make it easier to link to the docs for a specific attribute. There is currently an anchor on the tag (eg https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html#index-g_t_0040code_007bartificial_007d-function-attribute-2500 for artificial), but that will scroll down so that the name of the attribute isn't shown. Ideally the anchor should be on the tag instead. A nice enhancement would be to make the attribute name be a self-link to make it easy to get the link without needing to poke in the browser's inspect facility.
[Bug c++/100493] Lambda default copy capture that captures "this" cannot be used in both C++17 and C++20 modes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100493 Mathias Stearn changed: What|Removed |Added CC||redbeard0531 at gmail dot com --- Comment #2 from Mathias Stearn --- This is making it harder for us to transition to -std=c++20, since we can't have the same code be warning-clean in both versions. I really don't think the warning for [=, this] is helpful given that it is the pattern that is now recommended.
[Bug c++/100493] Lambda default copy capture that captures "this" cannot be used in both C++17 and C++20 modes
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100493 --- Comment #3 from Mathias Stearn --- When reading https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82611, I noticed that C++17 actually requires the warning on [=, this] from a conforming implementation: https://timsong-cpp.github.io/cppwp/n4659/expr.prim.lambda.capture#2. However, given that this requirement was lifted in C++20, would it be possible to only warn about that in -std=c++17 (or lower) with -pedantic? It seems counterproductive to warn about it without -pedantic.
[Bug c++/102528] New: Unable to inline even trivial coroutines
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102528 Bug ID: 102528 Summary: Unable to inline even trivial coroutines Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/aoab9W4xG This should all compile away, and test() should just be a single ret instruction. That is not what happens now, even with -O3. #include struct simple { struct promise_type { void return_void() {} std::suspend_never initial_suspend() { return {}; } std::suspend_never final_suspend() noexcept { return {}; } void unhandled_exception() { throw; } simple get_return_object() { return {}; } }; std::true_type await_ready() { return {}; } void await_suspend(std::coroutine_handle<>) {} void await_resume(); }; inline simple test1() { co_return; } inline simple test2() { co_await test1(); co_return; } void test() { test2(); }
[Bug c++/102528] Unused out-of-line functions emitted for trivial coroutines
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102528 --- Comment #1 from Mathias Stearn --- Sorry, there was a typo in the initial code. I forgot the trivial implementation of await_resume(). (D'oh!) Now I can see that test() is just a ret instruction, but there is still a lot of dead code emitted for the coroutine functions. While it isn't too much for the trivial functions, for a real use case it would be. And it seems like a bug anyway that unused code makes it to the obj file in -O3: https://godbolt.org/z/bTncofrzd #include struct simple { struct promise_type { void return_void() {} std::suspend_never initial_suspend() { return {}; } std::suspend_never final_suspend() noexcept { return {}; } void unhandled_exception() { throw; } simple get_return_object() { return {}; } }; std::true_type await_ready() {return {}; } void await_suspend(std::coroutine_handle<>) {} void await_resume() {} }; inline simple test1() { co_return; } inline simple test2() { co_await test1(); co_return; } void test() { test2(); } test1(test1()::_Z5test1v.frame*) [clone .actor]: movzx eax, WORD PTR [rdi+18] testal, 1 je .L2 cmp ax, 5 ja .L3 mov edx, 42 bt rdx, rax jnc .L3 .L4: cmp BYTE PTR [rdi+17], 0 jne .L14 .L1: ret .L2: cmp ax, 2 je .L5 cmp ax, 4 je .L4 testax, ax jne .L3 mov QWORD PTR [rdi+24], rdi .L5: cmp BYTE PTR [rdi+17], 0 mov BYTE PTR [rdi+32], 1 mov QWORD PTR [rdi], 0 je .L1 .L14: jmp operator delete(void*) test1(test1()::_Z5test1v.frame*) [clone .actor] [clone .cold]: .L3: ud2 test2(test2()::_Z5test2v.frame*) [clone .actor]: movzx eax, WORD PTR [rdi+18] testal, 1 je .L16 cmp ax, 7 ja .L17 mov edx, 170 bt rdx, rax jnc .L17 .L18: cmp BYTE PTR [rdi+17], 0 jne .L33 .L15: ret .L16: cmp ax, 4 je .L19 ja .L20 testax, ax jne .L34 mov QWORD PTR [rdi+24], rdi .L22: mov BYTE PTR [rdi+32], 1 .L19: cmp BYTE PTR [rdi+17], 0 mov QWORD PTR [rdi], 0 je .L15 .L33: jmp operator delete(void*) .L34: cmp ax, 2 je .L22 jmp .L17 .L20: cmp ax, 6 je .L18 jmp .L17 test2(test2()::_Z5test2v.frame*) [clone .actor] [clone .cold]: .L17: ud2 test1(test1()::_Z5test1v.frame*) [clone .destroy]: movzx eax, WORD PTR [rdi+18] or eax, 1 mov WORD PTR [rdi+18], ax cmp ax, 5 ja .L36 mov edx, 42 bt rdx, rax jnc .L36 cmp BYTE PTR [rdi+17], 0 jne .L39 ret .L39: jmp operator delete(void*) test1(test1()::_Z5test1v.frame*) [clone .destroy] [clone .cold]: .L36: ud2 test2(test2()::_Z5test2v.frame*) [clone .destroy]: movzx eax, WORD PTR [rdi+18] or eax, 1 mov WORD PTR [rdi+18], ax cmp ax, 7 ja .L41 mov edx, 170 bt rdx, rax jnc .L41 cmp BYTE PTR [rdi+17], 0 jne .L44 ret .L44: jmp operator delete(void*) test2(test2()::_Z5test2v.frame*) [clone .destroy] [clone .cold]: .L41: ud2 test(): ret
[Bug ipa/102528] Unused out-of-line functions emitted for trivial coroutines
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102528 --- Comment #8 from Mathias Stearn --- Sorry again about the confusion caused by my typo. I am not able to edit the comment to make it clear that the comment#0 should be ignored. If that happens again, would it be better for me to close the ticket and create a new one?
[Bug c++/102876] New: GCC fails to use constant initialization even when it knows the value to initialize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102876 Bug ID: 102876 Summary: GCC fails to use constant initialization even when it knows the value to initialize Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- See: https://godbolt.org/z/KnKv9j8b9 #include using namespace std::literals; /*not constexpr*/ inline std::chrono::sys_days compute_days() { return 1776y/7/4; } std::chrono::sys_days BURN_IN_TO_BINARY = compute_days(); Clang and MSVC both correctly burn BURN_IN_TO_BINARY into the binary image with the correct value. Even with -O3, gcc zero-initializes it then uses a dynamic initializer to complete the initialization. Both are valid implementation strategies according to https://eel.is/c++draft/basic.start.static#3, however, I think the strategy used by clang and MSVC is clearly superior QoI here.
[Bug c++/102876] GCC fails to use constant initialization even when it knows the value to initialize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102876 --- Comment #3 from Mathias Stearn --- > Why not just make the function constexpr though? That isn't always possible. Sometimes the initializer may call a third-party function that is inline, but not yet marked constexpr (it may need to support older language versions where it couldn't be constexpr). Other times the initializer may call a function that is out of line (so can't be constexpr at all), but defined in the same TU. MSVC and clang handle this somewhat more realistic example nicely, gcc doesn't: https://godbolt.org/z/jYKx8359T The original example using chrono was just something that when reading I thought "any optimizer worth its salt should be able to do this even without explicit constexpr annotation". I was disappointed to learn that gcc couldn't, so I filed a bug in the hope that it can be improved.
[Bug c++/102876] GCC fails to use constant initialization even when it knows the value to initialize
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102876 --- Comment #12 from Mathias Stearn --- (In reply to Jakub Jelinek from comment #10) > So we'd just punt at optimizing that, we don't know if b is read or written > by foo (and, note, it doesn't have to be just the case of explicitly being > passed address of some var, it can get the address through other means). > On the other side, we can't optimize b to b: .long 2, because bar can use > the variable and/or modify it, so by using 2 as static initializer bar would > be miscompiled. I'm pretty sure that that is explicitly allowed by https://eel.is/c++draft/basic.start.static#3, so it should *not* be considered a miscompilation. The example only shows reading from another static-duration variable's initializer, but I believe writing would also be covered. I took a look at how other compilers handle this, and it is somewhat interesting: https://godbolt.org/z/9YvcbEeax int foo(int&); inline int bar() { return 7; } extern int b; int z = bar(); int a = foo(b); // comment this out and watch clang change... int b = bar(); int c = bar(); GCC: always does dynamic init for everything. MSVC: always does static init for z, b, and c. always dynamic init for a. Clang: it seems like if it does any dynamic init in the TU, it doesn't promote any dynamic init to static. So with that code all four variables are dynamicially initialized, but if you comment out a, the remaining 3 become static. If you add an unrelated variable that requires dynamic init, those 3 become dynamically initialized again. I don't understand why clang does what it does. I don't think it is required to do that by the standard, and it clearly seems suboptimal. So I would rather GCC behave like MSVC in this case than like clang. Also note what happens if we provide a definition for foo like `inline int foo(int& x) { return x += 6; }`: https://godbolt.org/z/sWd6chsnP. Now both MSVC and Clang will static initialize z, b, and c to 7 and *static* initialize a to 6. GCC gets the same result dynamically, but for some reason tries to load b prior to adding 6, even though it has to be 0 (barring a UB write to b from another TU).
[Bug target/105496] New: Comparison optimizations result in unnecessary cmp instructions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105496 Bug ID: 105496 Summary: Comparison optimizations result in unnecessary cmp instructions Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/1zdYsaqEj Consider these equivalent functions: int test1(int x) { if (x <= 10) return 123; if (x == 11) return 456; return 789; } int test2(int x) { if (x < 11) return 123; if (x == 11) return 456; return 789; } In test2 it is very clear that you can do a single cmp of x with 11 then use different flag bits to choose your case. In test1 it is less clear, but because x<=10 and x<11 are equivalent, you could always transform one to the other. Clang seems to do this correctly and transforms test1 into test2 and only emits a single cmp instruction in each. For some reason, not only does gcc miss this optimization, it seems to go the other direction and transform test2 into test1, emitting 2 cmp instructions for both! test1(int): mov eax, 123 cmp edi, 10 jle .L1 cmp edi, 11 mov eax, 456 mov edx, 789 cmovne eax, edx .L1: ret test2(int): mov eax, 123 cmp edi, 10 jle .L6 cmp edi, 11 mov eax, 456 mov edx, 789 cmovne eax, edx .L6: ret Observed with at least -O2 and -O3. I initially observed this for code where each if generated an actual branch rather than a cmov, but when I reduced the example, the cmov was generated. I'm not sure if this should be a middle-end or target specific optimization, since ideally it would be smart on all targets that use comparison flags, even if there are some targets that don't. Is there ever a down side to trying to make two adjacent comparisons compare the same number?
[Bug c++/105534] New: -Wmaybe-uninitialized shouldn't suppress -Wuninitialized warnings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105534 Bug ID: 105534 Summary: -Wmaybe-uninitialized shouldn't suppress -Wuninitialized warnings Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- The following function emits a -Wuninitialized warning on ++count with -Wall https://godbolt.org/z/KfaMEETY1: int test(bool cond) { int count; ++count; return count; } Making the increment be conditional changes it to a -Wmaybe-uninitialized warning, which is suppressed with -Wno-maybe-uninitialized. https://godbolt.org/z/qarMrqW7E int test(bool cond) { int count; if (cond) ++count; return count; } This makes no sense. count is never initialized on any path through the function, and it is returned on all paths. We use -Wall with -Wno-maybe-uninitialized on our codebase because we were getting too many false-positives with -Wmaybe-initialized, in particular from third-party headers that we didn't want to modify. At the time we decided to do that, we didn't realize that we would also be missing out on clearly uninitialized cases like this.
[Bug c++/105534] -Wmaybe-uninitialized cases shouldn't suppress -Wuninitialized warnings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105534 --- Comment #3 from Mathias Stearn --- One slightly confusing aspect is that the wording of the flag implies that the variable may or may not be uninitialzied (because in -Wmaybe-uninitialized maybe binds directly to uninitialized), while phrasing of the warning message is about the usage being conditional: "may be used uninitialized". And of course the documentation (at least the man page) uses a different phrasing: > For an object with automatic or allocated storage duration, if there exists a > path from the function entry to a use of the object that is initialized, but > there exist some other paths for which the object is not initialized, the > compiler emits a warning if it cannot prove the uninitialized paths are not > executed at run time. For both the initial example with count, and your example with count2, I'd say that the "there exists a path from the function entry to a use of the object that is initialized" bit is clearly not satisfied, so if we assume the documentation is correct, then those cases both lack a "maybe" and the variables are clearly uninitialized. This would also match my intuition for -Winitialized which is that it definitively errors if all paths from declaration to any usage result in the variable being uninitialized. PS - This test case is a reduced example from an actual bug that luckily was found by coverity before release: https://jira.mongodb.org/browse/SERVER-66306. I dug in to make a repro because I was expecting that we would have gotten a compiler error on that code before it was even committed. I'm also exploring whether we can stop passing -Wno-maybe-uninitialized, but it looks like we still get false positives in third-party headers, so it doesn't seem likely.
[Bug c++/105534] -Wmaybe-uninitialized cases shouldn't suppress -Wuninitialized warnings
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105534 --- Comment #4 from Mathias Stearn --- (In reply to Richard Biener from comment #2) > Note there's > > _2 = value_1(D) * x_2; > > where _2 might be effectively uninitialized iff x_2 is not zero. When x_2 > is zero then _2 has a well-defined value. Not according to the C++ standard. http://eel.is/c++draft/basic.indet seems pretty clear that unless x_2 is std::byte (which doesn't support multiplication) or an "unsigned ordinary character type", then that is UB. FWIW I still think I'd expect the warning on "unsigned char x, y; y = x * 0;", but I would definitiely expect the warning for int.
[Bug c++/105560] New: Spurious -Wunused-local-typedefs warning on a typedef used on returned type
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105560 Bug ID: 105560 Summary: Spurious -Wunused-local-typedefs warning on a typedef used on returned type Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/zcY11ezev #include #include template typename F::Out call(F&& fun) { typename F::Out var = fun(1); return var; } void test() { // Fill builder with bson-encoded elements from view. auto decorateLambda = [](auto&& lambda) { using Lambda = std::remove_reference_t; struct DecoratedLambda : Lambda { using Out = bool; }; return DecoratedLambda{std::forward(lambda)}; }; call(decorateLambda([&](auto&& value) { (void)(value); return true; })); } With -O2 -std=c++20 -Wall: : In lambda function: :15:19: warning: typedef 'using test()DecoratedLambda::Out = bool' locally defined but not used [-Wunused-local-typedefs] 15 | using Out = bool; | ^~~ Compiler returned: 0 However, DecoratedLambda::Out *is* used by call(), both in the signature and in the body. Perhaps the issue is that typedefs in local types that are returned (or are reachable from returned types) shouldn't be considered "local"?
[Bug target/105661] New: Comparisons to atomic variables generates less efficient code
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105661 Bug ID: 105661 Summary: Comparisons to atomic variables generates less efficient code Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- With normal variables, gcc will generate a nice cmp/js pair when checking if the high bit is null. When using an atomic, gcc generates a movzx/test/js triple, even though it could use the same codegen as for a non-atomic. https://godbolt.org/z/GorvWfrsh #include #include [[gnu::noinline]] void f(); uint8_t plain; std::atomic atomic; void plain_test() { if (plain & 0x80) f(); } void atomic_test() { if (atomic.load(std::memory_order_relaxed) & 0x80) f(); } With both -O2 and -O3 this generates: plain_test(): cmp BYTE PTR plain[rip], 0 js .L4 ret .L4: jmp f() atomic_test(): movzx eax, BYTE PTR atomic[rip] testal, al js .L7 ret .L7: jmp f() ARM64 seems to be hit even harder, but I don't know that platform well enough to know if the non-atomic codegen is valid there https://godbolt.org/z/c3h8Y1dan. It seems likely though, at least for a relaxed load.
[Bug c++/105397] Cannot export module initializer symbols with `-fvisibility=hidden`
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105397 Mathias Stearn changed: What|Removed |Added CC||redbeard0531 at gmail dot com --- Comment #1 from Mathias Stearn --- Perhaps the best option is to default the visibility of the implicit functions to the widest visibility of any function or object in module purview exposed by the TU. The assumption being that if anything is visibile outside the library, then it is expected to be imported from TUs outside the library and that should Just Work. Conversely, if everything is defined as internal visibility, then it is unlikely that this module was intended to be imported from outside of the library, and so it may be desireable to allow different libs to have their own module with the same name. Unfortunately that doesn't give any good indication of what to do for importable units that have an empty module purview (or where everything inside it has TU-local internal linkage). While legal, maybe that isn't a case worth optimizing the Just Works experience for?
[Bug tree-optimization/59429] Missed optimization opportunity in qsort-style comparison functions
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59429 --- Comment #16 from Mathias Stearn --- Trunk still generates different code for all cases (in some cases subtly so) for both aarch64 and x86_64: https://www.godbolt.org/z/1s6sxrMWq. For both arches, it seems like LE and LG generate the best code. On aarch64, they probably all have the same throughput, but EL and EG have a size penalty with an extra instruction. On x86_64, there is much more variety. EL and EG both get end up with a branch rather than being branchless, which is probably bad since comparison functions are often called in ways that the result branches are unpredictable. GE and GL appear to have regressed since this ticket was created. They now do the comparison twice rather than reusing the flags from the first comparison: comGL(int, int): xor eax, eax cmp edi, esi mov edx, 1 setlal neg eax cmp edi, esi cmovg eax, edx ret comGE(int, int): xor eax, eax cmp edi, esi mov edx, 1 setne al neg eax cmp edi, esi cmovg eax, edx ret
[Bug libstdc++/111588] New: Provide opt-out of shared_ptr single-threaded optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111588 Bug ID: 111588 Summary: Provide opt-out of shared_ptr single-threaded optimization Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- Right now there is a fast-path for single-threaded programs to avoid the overhead of atomics in shared_ptr, but there is no equivalent for a program the knows it is multi-threaded to remove the check and branch. If __GTHREADS is not defined then no atomic code is emitted. There are two issues with this: 1) for programs that know they are effectively always multithreaded they pay for a runtime branch and .text segment bloat for an optimization that never applies. This may have knock-on effects of making functions that use shared_ptr less likely to be inlined by pushing them slightly over the complexity threshold. 2) It invalidates singlethreaded microbenchmarks of code that uses shared_ptr because the performance of the code may be very different from when run in the real multithreaded program. I understand the value of making a fast mode for single-threaded code, and I can even except having the runtime branch by default, rather than as an opt-in, when it is unknown if the program will be run with multiple threads. But an opt-out would be nice to have. If it had to be a gcc-build time option rather than a #define, that would be acceptable for us since we always use our own build of gcc, but it seems like a worse option for other users. FWIW, neither llvm libc++ (https://github.com/llvm/llvm-project/blob/0bfaed8c612705cfa8c5382d26d8089a0a26386b/libcxx/include/__memory/shared_ptr.h#L103-L110) nor MS-STL (https://github.com/microsoft/STL/blob/main/stl/inc/memory#L1171-L1173) ever use runtime branching to detect multithreading.
[Bug libstdc++/111589] New: Use relaxed atomic increment (but not decrement!) in shared_ptr
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111589 Bug ID: 111589 Summary: Use relaxed atomic increment (but not decrement!) in shared_ptr Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- The atomic increment when copying a shared_ptr can be relaxed because it is never actually used as a synchronization operation. The current thread must already have sufficient synchronization to access the memory because it can already deref the pointer. All synchronization is done either via whatever program-provided code makes the shared_ptr object available to the thread, or in the atomic decrement (where the decrements to non-zero are releases that ensure all uses of the object happen before the final decrement to zero acquires and destroys the object). As an argument-from-authority, libc++ already is using relaxed for increments and acq/rel for decements: https://github.com/llvm/llvm-project/blob/c649fd34e928ad01951cbff298c5c44853dd41dd/libcxx/include/__memory/shared_ptr.h#L101-L121 This will have no impact on x86 where all atomic RMWs are effectively sequentially consistent, but it will enable the use of ldadd rather than ldaddal on aarch64, and similar optimizations on other weaker architectures.
[Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611 Mathias Stearn changed: What|Removed |Added CC||redbeard0531 at gmail dot com --- Comment #4 from Mathias Stearn --- clang has already been using the optimized memcmp code since v16, even at -O1: https://www.godbolt.org/z/qEd768TKr. Older versions (at least since v9) were still branch-free, but via a less optimal sequence of instructions. GCC's code gets even more ridiculous at 32 bytes, because it does a branch after every 8-byte compare, while the clang code is fully branch-free (not that branch-free is always better, but it seems clearly so in this case). Judging by the codegen, there seems to be three deficiencies in GCC: 1) an inability to take advantage of the load-pair instructions to load 16-bytes at a time, and 2) an inability to use ccmp to combine comparisons. 3) using branching rather than cset to fill the output register. Ideally these could all be done in the general case by the low level instruction optimizer, but even getting them special cased for memcmp (and friends) would be an improvement.
[Bug c++/114357] Add a way to not call deconstructors for non-auto decls, like clang's no_destroy attribute
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114357 Mathias Stearn changed: What|Removed |Added CC||redbeard0531 at gmail dot com --- Comment #9 from Mathias Stearn --- For codebases that always use `quick_exit()` or `_Exit()` to exit, all static destructors are dead code bloat. For constinit function-statics with non-trivial destructors (that will never run) this results in unnecessary overhead each time the function is called: https://godbolt.org/z/jsjTYdMf3 > It changes behavior and even could cause memory leaks if abused. For static duration objects there is no such thing as a memory leak. The whole address space is going away. Allowing this on thread-local duration objects is probably a mistake though (with a possible exception for embedded and kernel-level code that does its own thread teardown).
[Bug middle-end/116383] New: Value from __atomic_store not forwarded to non-atomic load at same address
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116383 Bug ID: 116383 Summary: Value from __atomic_store not forwarded to non-atomic load at same address Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://godbolt.org/z/1bbjoc87n int test(int* i, int val) { __atomic_store_n(i, val, __ATOMIC_RELAXED); return *i; } The non-atomic load should be able to directly use the value stored by the atomic store, but instead GCC issues a new load: mov DWORD PTR [rdi], esi mov eax, DWORD PTR [rdi] ret Clang recognizes that the load is unnecessary and propagates the value: mov eax, esi mov dword ptr [rdi], esi ret In addition to simply being an unnecessary load, there is an additionally penalty in most CPUs from accessing reading a value still in the CPU's store buffer which it almost certainly would be in this case. And of course this also disables further optimizations eg DSE and value propagation where the compiler knows something about the value. void blocking_further_optimizations(int* i) { if (test(i, 1) == 0) { __builtin_abort(); } } generates the following with gcc mov DWORD PTR [rdi], 1 mov edx, DWORD PTR [rdi] testedx, edx je .L5 ret blocking_further_optimizations(int*) [clone .cold]: .L5: pushrax callabort And this much better output with clang mov dword ptr [rdi], 1 ret While I'm using a relaxed store here to show that gcc doesn't apply the optimization in that case, I think the optimization should apply regardless of memory ordering (and clang seems to agree). Also while the minimal example code is contrived, there are several real-world use cases where this pattern can come up. I would expect it in cases where there is a single writer thread but many reader threads. The writes and off-thread reads need to use __atomic ops to avoid data races, but on-thread reads should be safe using ordinary loads, and you would want them to be optimized as much as possible.
[Bug other/111976] New: Large constant zero-valued objects should go in .bss rather than .rodata
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111976 Bug ID: 111976 Summary: Large constant zero-valued objects should go in .bss rather than .rodata Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- Right now constant objects always go in .rodata. This is nice because it will give a nice loud error if you ever write to it. However, .rodata takes up space in the binary file and in memory at runtime. If you instead put it in .bss it takes up no space in the binary file and, at least on linux, it gets backed by a single zero-filled page of physical memory. Ideally there would be something like .robss which gave you the best of both worlds, but this is admittedly niche for all the effort to add a new section like that. I think the best option is to leave it in .rodata for non-optimized builds to catch bugs, but when optimizing, especially with -Os, put it in .bss. Repro https://www.godbolt.org/z/3rWvTrsTv: constexpr int GB = 1024*1024*1024; // Goes in .bss - no space in binary or runtime. char nonconst_zeros[GB] = {}; // Goes in .rodata - expensive in binary size and runtime memory. extern const char const_zeros[GB]; // force usage const char const_zeros[GB] = {}; .globl const_zeros .section.rodata .align 32 .type const_zeros, @object .size const_zeros, 1073741824 const_zeros: .zero 1073741824 .globl nonconst_zeros .bss .align 32 .type nonconst_zeros, @object .size nonconst_zeros, 1073741824 nonconst_zeros: .zero 1073741824
[Bug other/111976] Large constant zero-valued objects should go in .bss rather than .rodata
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111976 --- Comment #1 from Mathias Stearn --- Just to be clear, I think we should only do this for "large" objects or collections of objects. If you did it for small objects in general, there is a risk that they will spread out mutable data that is written to over more pages, so that you end up with more runtime anonymous pages, rather than read-only pages backed by the file cache, so they end up being more expensive. I think a good rule to prevent this would be to only do it for objects larger than a page, and possibly add page alignment to them. It may also be possible to collect enough smaller objects together to make it worth doing this. Not sure how often that occurs in practice though. Also at -Os it may make sense to do this for any size object putting since small objects in .bss will still shrink the binary size. Not sure how users of -Os would react to tradeoffs involving runtime memory consumption vs binary size.
[Bug libstdc++/111588] Provide opt-out of shared_ptr single-threaded optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111588 --- Comment #4 from Mathias Stearn --- So even though I already knew in the back of my mind about how this can affect benchmark results, I *still* got burned by it! I was benchmarking a simple hazard pointer implementation against shared ptrs by having them crab-walk a list of 1000 pointers. This was an admittedly simple and unrealistic benchmark that only ever accessed the objects from a single thread and never had any contention. It was a few hours after getting the initial results that I remembered this issue and went back to add a bg thread. > This needs numbers, not opinions While my biggest concern was misleading benchmarks (which I now feel a bit validated by my own mistake :), here are the numbers I see. I added boost::shared_ptr on the assumption (unvalidated) that the primary difference would be that it unconditionally uses atomics. --- Benchmark Time CPU Iterations --- BM_scanLinks_HazPtr6420 ns 6420 ns 108948 BM_scanLinks_BoostSharedPtr 11223 ns11223 ns62380 BM_scanLinks_StdSharedPtr 3217 ns 3217 ns 217621 BM_scanLinks_AtomicSharedPtr 28920 ns28920 ns24200 And with a bg thread doing nothing but sleeping: --- Benchmark Time CPU Iterations --- BM_scanLinks_HazPtr6887 ns 6887 ns 101445 BM_scanLinks_BoostSharedPtr 11224 ns11224 ns62373 BM_scanLinks_StdSharedPtr 14221 ns14221 ns49221 BM_scanLinks_AtomicSharedPtr 49574 ns49573 ns14126
[Bug libstdc++/111588] Provide opt-out of shared_ptr single-threaded optimization
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111588 --- Comment #5 from Mathias Stearn --- Mea culpa. The difference between boost and std was due to the code to fast-path shared_ptrs that aren't actually shared: https://github.com/gcc-mirror/gcc/blob/be34a8b538c0f04b11a428bd1a9340eb19dec13f/libstdc%2B%2B-v3/include/bits/shared_ptr_base.h#L324-L362. I still think that optimization is a good idea, even if it looks bad in this specific microbenchmark. When that is disabled, they have the same perf, even with the check for single-threaded. That said, I'd still love an opt out. For now, I'll just propose that we add a do-nothing bg thread in our benchmark main() to avoid misleading results.
[Bug middle-end/113585] New: Poor codegen turning int compare into -1,0,1
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113585 Bug ID: 113585 Summary: Poor codegen turning int compare into -1,0,1 Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: middle-end Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- https://www.godbolt.org/z/Y31xG7EeT These two functions should be equivalent, but comp2() produces better code than comp1() on both arm64 and x86_64 int comp1(int a, int b) { return a == b ? 0 : (a < b ? -1 : 1); } int comp2(int a, int b) { return a < b ? -1 : (a > b ? 1 : 0); } arm64: comp1(int, int): cmp w0, w1 mov w0, -1 csinc w0, w0, wzr, lt cselw0, w0, wzr, ne ret comp2(int, int): cmp w0, w1 csetw0, gt csinv w0, w0, wzr, ge ret x86_64: comp1(int, int): xor eax, eax cmp edi, esi je .L1 setge al movzx eax, al lea eax, [rax-1+rax] .L1: ret comp2(int, int): xor eax, eax cmp edi, esi mov edx, -1 setgal cmovl eax, edx ret In the arm case, I suspect that the perf will be equivalent, although comp1 has an extra instruction, so will pay a size penalty. On x86, comp2 is branchless while comp1 has a branch which could be problematic if not predictable. It seems like there should be a normalization pass to convert these functions (and other equivalent ones) into a single normalized representation so that they get the same codegen. PS: I tried another equivalent function and it produces even worse codegen on x86_64, comparing the same registers twice: int comp3(int a, int b) { return a > b ? 1 : (a == b ? 0 : -1); } comp3(int, int): xor eax, eax cmp edi, esi mov edx, 1 setne al neg eax cmp edi, esi cmovg eax, edx ret
[Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682 Bug ID: 113682 Summary: Branches in branchless binary search rather than cmov/csel/csinc Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: other Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- I've been trying to eliminate unpredictable branches in a hot function where perf counters show a high mispredict percentage. Unfortunately, I haven't been able to find an incantation to get gcc to generate branchless code other than inline asm, which I'd rather avoid. In this case I've even laid out the critical lines so that they exactly match the behavior of the csinc and csel instructions on arm64, but they are still not used. Somewhat minimized repro: typedef unsigned long size_t; struct ITEM {char* addr; size_t len;}; int cmp(ITEM* user, ITEM* tree); size_t bsearch2(ITEM* user, ITEM** tree, size_t treeSize) { auto pos = tree; size_t low = 0; size_t high = treeSize; while (low < high) { size_t i = (low + high) / 2; int res = cmp(user, tree[i]); // These should be cmp + csinc + csel on arm // and lea + test + cmov + cmov on x86. low = res > 0 ? i + 1 : low; // csinc high = res < 0 ? i: high; // csel if (res == 0) return i; } return -1; } On arm64 that generates a conditional branch on res > 0: bl cmp(ITEM*, ITEM*) cmp w0, 0 bgt .L15 // does low = i + 1 then loops mov x20, x19 bne .L4 // loop On x86_64 it does similar: callcmp(ITEM*, ITEM*) testeax, eax jg .L16 jne .L17 Note that clang produces the desired codegen for both: arm: bl cmp(ITEM*, ITEM*) cmp w0, #0 csinc x23, x23, x22, le cselx19, x22, x19, lt cbnzw0, .LBB0_1 x86: callcmp(ITEM*, ITEM*)@PLT lea rcx, [r12 + 1] testeax, eax cmovg r13, rcx cmovs rbx, r12 jne .LBB0_1 (full output for all 4 available at https://www.godbolt.org/z/aWrKbYPTG. Code snippets from trunk, but also repos on 13.2) While ideally gcc would generate the branchless output for the supplied code, if there is some (reasonable) incantation that would cause it to produce branchless output, I'd be happy to have that too.
[Bug middle-end/113682] Branches in branchless binary search rather than cmov/csel/csinc
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682 --- Comment #5 from Mathias Stearn --- (In reply to Andrew Pinski from comment #2) > I should note that on x86, 2 cmov in a row might be an issue and worse than > branches. There is a cost model and the x86 backend rejects that. > > There are some cores where it is worse. I don't know if it applies to recent > ones though. Do you know if that applies to any cores that support x86_64? I checked Agner Fog's tables, and only very very old cores (P4 era) had high reciprocal throughput, but even then it was less than latency. It looks like all AMD cores and intel cores newer than ivy bridge (ie everything from the last 10 years) are able to execute multiple CMOVs per cycle (reciprocal throughput < 1). From what I can see, it looks like bad CMOV was a particular problem of the Pentium 4 and Prescott cores, and possibly PPro, but I don't see the numbers for it. I don't think any of those cores should have an impact on the default cost model in 2024.
[Bug target/113723] New: -freorder-blocks-and-partition emits bogus asm directives on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113723 Bug ID: 113723 Summary: -freorder-blocks-and-partition emits bogus asm directives on aarch64 Product: gcc Version: 14.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: target Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- It tries to subtract labels across the function split which the assembler rejects. At the very least it does this when generating a switch jump table, but there may be other cases. https://www.godbolt.org/z/Ynh1zzY4a Repro with -O3 -freorder-blocks-and-partition: [[gnu::cold]]void cold(); template void f(); #define GENERATE_CASE case __COUNTER__: f<__COUNTER__>(); break void test(int i) { switch(i) { case 1: cold(); break; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; GENERATE_CASE; } } test(int): cmp w0, 18 bls .L17 .L1: ret .L17: adrpx1, .L4 add x1, x1, :lo12:.L4 ldrbw1, [x1,w0,uxtw] adr x0, .Lrtx4 add x1, x0, w1, sxtb #2 br x1 .Lrtx4: .L4: .byte (.L14 - .Lrtx4) / 4 .byte (.L13 - .Lrtx4) / 4 // THIS IS THE BAD ONE .byte (.L12 - .Lrtx4) / 4 ... test(int) [clone .cold]: .L13: b cold() /tmp/cck05x4u.s: Assembler messages: /tmp/cck05x4u.s:48: Error: invalid operands (.text.unlikely and .text sections) for `-' Compiler returned: 1 I've repro'd on 11,12, and trunk.
[Bug target/113723] -freorder-blocks-and-partition emits bogus asm directives on aarch64
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113723 --- Comment #1 from Mathias Stearn --- As a workaround sticking [[gnu::optimize("no-reorder-blocks-and-partition")]] on each function that triggers this seems to work. But that is obviously not a scalable solution.
[Bug libstdc++/117655] std::string::swap() could be much faster and smaller
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117655 --- Comment #2 from Mathias Stearn --- It looks like a similar optimization would make sense for operator=(string&&): https://godbolt.org/z/Wo19fjKeK. It might also make sense to just copy the whole buffer for the local case in operator=(const string&), but the win is probably less than the rval and swap optimizations.
[Bug libstdc++/117655] New: std::string::swap() could be much faster and smaller
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117655 Bug ID: 117655 Summary: std::string::swap() could be much faster and smaller Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- TL;DR: it should swap the object (or value) representations unconditionally, then do (ideally branchless) fixups for strings that were self-referential. It could generate 20 lines of branchless x64_64 asm assembling to 43 bytes of instructions, while today it is 426 very branchy lines and 1424 bytes. The current implementation branches into 6(!) different implementations based on whether the source and destination are local (ie self-referential) strings and based on whether those self-referential strings are empty. Furthermore, when copying the self-referential strings, it uses a variable-sized copy which becomes a branch tree on the size. I think that none of that branching is necessary* and is likely to be slower on modern hardware than just unconditionally swapping all 32 bytes (or less) of the representation, then repointing the pointers to be self-referential when they point to the other's buffer. This shows the codegen for a hastily written prototype of the optimized swap under the current implementation: https://godbolt.org/z/jrYdWEEsY * Caveat: I'm only talking about std::string, in particular with the default allocator. The current implementation (or something similar) may still be important for non-default allocators, especially where propagate_on_container_swap is false. However, that shouldn't prevent optimizing for the common case of using the default allocator.
[Bug libstdc++/117650] New: __glibcxx_assert_fail should be marked __attribute__((cold))
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117650 Bug ID: 117650 Summary: __glibcxx_assert_fail should be marked __attribute__((cold)) Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- The cold attribute marks incoming edges much more strongly cold than __builtin_expect (which IIRC currently assumes only a 90/10 bias). It even appears to be stronger than __builtin_expect_with_probability when passed 0/100, and is treated closer to how a thrown exception is, eg by splitting the function into hot and cold parts, so that the cold part doesn't waste icache and iTLB space that could go to another function. Since __glibcxx_assert_fail should be even less frequent than a thrown exception (ideally never, if software didn't have all those pesky bugs...), it should be optimized against at least as strongly. https://godbolt.org/z/svjPzrf5z Shows the impact on codegen by making __GLIBCXX_NORETURN imply cold in addition to noreturn. This seemed like the easiest way to demonstrate this. It also might be worth considering since noreturn functions tend to be cold ones. We have noreturn imply cold in our codebase, FWIW. PS- ABI concerns may make it too late to change this, but it may also be worth considering having the fail function take a single pointer to a statically allocated struct with all of the arguments, rather than generating them as separate arguments. As you can see, there are 4 instructions filling in the arguments where it could just be 1. Ideally there would be some way to encode the string addresses such that they don't need runtime relocations at startup, eg, by encoding them as the offset from the static struct pointer to the string, then decoding them only in the fail function.
[Bug c++/119394] New: Add fixit hint suggesting adding () to [] noexcept {} in C++ < 23
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119394 Bug ID: 119394 Summary: Add fixit hint suggesting adding () to [] noexcept {} in C++ < 23 Product: gcc Version: 15.0 Status: UNCONFIRMED Severity: normal Priority: P3 Component: c++ Assignee: unassigned at gcc dot gnu.org Reporter: redbeard0531 at gmail dot com Target Milestone: --- Note that clang alread does this: https://godbolt.org/z/73T38rWM7 void foo(auto){} void test() { foo([] noexcept {}); } No fixit when running `g++ -std=c++20 -Wall -Werror`: : In function 'void test()': :3:12: error: parameter declaration before lambda exception specification only optional with '-std=c++23' or '-std=gnu++23' [-Werror=c++23-extensions] 3 | foo([] noexcept {}); |^~~~ cc1plus: all warnings being treated as errors Clang20 output with fixit: :3:12: error: lambda without a parameter clause is a C++23 extension [-Werror,-Wc++23-extensions] 3 | foo([] noexcept {}); |^ |() 1 error generated. It might also be better to just replace the technical sounding "parameter declaration before lambda exception specification" with just "() before noexcept on a lambda" which I think will be easier for the mythical "normal C++ developer" to understand.