[Bug c/99339] New: Poor codegen with simple varargs

2021-03-02 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

Bug ID: 99339
   Summary: Poor codegen with simple varargs
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

These two functions should generate the same code, but the varargs version is
much worse. And it is even worser still when you enable
-fstack-protector-strong.

https://godbolt.org/z/noEYoh

#include 

int test_va(int x, ...) {
int i;
va_list va;
va_start(va, x);
i = va_arg(va, int);
va_end(va);
return i + x;
}

int test_args(int x, int i) {
return i + x;
}

# explicit args with or without stack protection
test_args:
lea eax, [rsi+rdi]
ret

# without stack protection (why aren't dead stores to the stack being
eliminated?)
test_va:
lea rax, [rsp+8]
mov QWORD PTR [rsp-40], rsi
mov QWORD PTR [rsp-64], rax
lea rax, [rsp-48]
mov QWORD PTR [rsp-56], rax
mov eax, DWORD PTR [rsp-40]
mov DWORD PTR [rsp-72], 8
add eax, edi
ret

# with stack protection (yikes!)
test_va:
sub rsp, 88
mov QWORD PTR [rsp+40], rsi
mov rax, QWORD PTR fs:40
mov QWORD PTR [rsp+24], rax
xor eax, eax
lea rax, [rsp+96]
mov DWORD PTR [rsp], 8
mov QWORD PTR [rsp+8], rax
lea rax, [rsp+32]
mov QWORD PTR [rsp+16], rax
mov eax, DWORD PTR [rsp+40]
add eax, edi
mov rdx, QWORD PTR [rsp+24]
sub rdx, QWORD PTR fs:40
jne .L7
add rsp, 88
ret
.L7:
call__stack_chk_fail

[Bug middle-end/99339] Poor codegen with simple varargs

2021-03-02 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

--- Comment #5 from Mathias Stearn  ---
I filed a related bug with clang right after this one if anyone want to follow
along https://bugs.llvm.org/show_bug.cgi?id=49395.

Just because clang does worse doesn't mean gcc shouldn't do better ;)

[Bug libstdc++/95079] unorderd_map::insert_or_assign and try_emplace should only hash and mod once unless there is a rehash.

2021-03-02 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=95079

--- Comment #5 from Mathias Stearn  ---
@François Dumont: Sorry I didn't see your question earlier. The reason that
unordered_map perf hurts on 64-bit platforms is because it is designed to do a
size_t modulus-by-prime on every lookup, and on most platforms that is *very*
expensive (up to 100 cycles for 64 bits vs 20ish for 32 bits). Some very modern
CPUs have made improvements here, but it is still much more expensive than just
using power-of-2 buckets and masking, even if you need to hash the hash if you
don't trust the low order bits to have enough entropy. Unfortunately, fixing
this is a pretty big ABI break, so it isn't going to change any time soon.

[Bug middle-end/99339] Poor codegen with simple varargs

2021-03-02 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99339

--- Comment #6 from Mathias Stearn  ---
> The question is how common in the wild it is and if it is worth the work.

I'm not sure how common it is, but this is my use case. The code in the bug
report is a slightly simplified example from some Real World Code I am working
on:
https://source.wiredtiger.com/develop/struct_w_t___c_u_r_s_o_r.html#af19f6f9d9c7fc248ab38879032620b2f.
Essentially WT_CURSOR is a structure of function pointers that all take a
WT_CURSOR pointer as the first argument, similar to C++ virtual functions. The
get_key() and get_value() "methods" both take (WT_CURSOR*, ...) arguments, and
unpack the arguments based on the format string that was set up when you opened
the cursor. The expectation is that they will be called many times for each
cursor object. Since we know the format string when creating the cursor I was
experimenting with creating specialized functions for common formats rather
than dispatching through the generic format-inspecting logic every time. I
noticed that I was able to get even more performance by declaring the functions
as taking the arguments they actually take rather than dealing with va_args,
then casting the function pointers to the generic (WT_CURSOR, ...) type when
assigning into the method slot. I assume this is UB in C (I don't know the C
standard nearly as well as C++) but all ABIs I know of are designed to make
this kind of thing work.

I'd rather not have to do that kind of questionable shenanigans in order to get
the same performance.

[Bug c++/99470] New: Convert fixed index addition to array address offset

2021-03-08 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99470

Bug ID: 99470
   Summary: Convert fixed index addition to array address offset
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

These two functions do the same thing but f() is the cleaner source code
(especially when arr is a std::array) while g() generates better code:

https://gcc.godbolt.org/z/vTT399

#include 

inline int8_t arr[256];

bool f(int a, int b) {
return arr[a+128] == arr[b+128];
}

bool g(int a, int b) {
return (arr+128)[a] == (arr+128)[b];
}

f(int, int):
sub edi, -128
sub esi, -128
lea rax, arr[rip]
movsx   rdi, edi
movsx   rsi, esi
movzx   edx, BYTE PTR [rax+rsi]
cmp BYTE PTR [rax+rdi], dl
seteal
ret
g(int, int):
lea rax, arr[rip+128]
movsx   rdi, edi
movsx   rsi, esi
movzx   edx, BYTE PTR [rax+rsi]
cmp BYTE PTR [rax+rdi], dl
seteal
ret

In addition to only doing the +128 once, it also ends up being completely free
in g() because the assembler (or linker?) folds the addition into the address
calculation by adjusting the offset of the rip-relative address. In the godbolt
link, you can see that when compiled to binary, the LEA instruction uses the
same form in both f() and g(), so the addition really is free in g().

[Bug rtl-optimization/99470] Convert fixed index addition to array address offset

2021-03-09 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99470

Mathias Stearn  changed:

   What|Removed |Added

 Status|RESOLVED|UNCONFIRMED
 Resolution|INVALID |---

--- Comment #4 from Mathias Stearn  ---
Yes, but I believe any case where they would access different addresses would
be UB overflow in f(), making it valid to turn f() into g(), especially if you
used an internal lowering which sign extended index to pointer width and had
defined wrapping semantics. I'll note that clang already generates identical
code for f() and g() https://gcc.godbolt.org/z/j897sh, although I think gcc has
better codegen at least for g().

Also, my example was perhaps oversimplified. My indexes were actually int8_t
(which is why I'm indexing into a 256-element array), so due to int promotion,
overflow is actually impossible. However, with int8_t arguments, gcc generates
even worse code for f(), doing the sign-extension twice for some reason (8 ->
32 -> 64): https://gcc.godbolt.org/z/5r9h89


(I hope it isn't a faux pas to reopen the ticket, but I think I've provided
enough new information that this warrants another look)

[Bug libstdc++/99979] New: condition_variable_any has wrong behavior if Lock::lock() throws

2021-04-08 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99979

Bug ID: 99979
   Summary: condition_variable_any has wrong behavior if
Lock::lock() throws
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

https://eel.is/c++draft/thread.condvarany.wait says "Postconditions: lock is
locked by the calling thread. […] Remarks: If the function fails to meet the
postcondition, terminate() is invoked […] This can happen if the re-locking of
the mutex throws an exception." This wording was changed in C++14 for
https://cplusplus.github.io/LWG/issue2135, but given that it is borderline
impossible to use a condition_variable[_any] correctly if it doesn't guarantee
relocking, it seems best to do the same in C++11 mode as well.

Unfortunately, std::condition_variable_any::__Unlock::~__Unlock() either
ignores[1] or propagates any exception thrown by lock(), which means that the
postcondition of wait() may not be fulfilled. I think the correct behavior
would be to remove the noexcept(false), allowing the destructor's default
noexecpt to apply, and just unconditionally call _M_lock.lock() without any
try/catch.

[1] It ignores if std::uncaught_exception() returns true. I'm not 100% sure why
that dispatch is there. It seems like it may be trying to detect if
_M_cond.wait() threw, but a) that is noexcept and b) that doesn't work
correctly if some thread waits on a condvar inside a destructor that is
triggered during exception unwinding (this is why std::uncaught_exceptions()
was introduced).

[Bug c++/98639] GCC accepts cast from Base to Derived in C++20 mode

2021-01-12 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98639

Mathias Stearn  changed:

   What|Removed |Added

 CC||redbeard0531 at gmail dot com

--- Comment #4 from Mathias Stearn  ---
This really seems like it *is* supposed to be allowed based on simply following
the normal language rules for aggregates. It would probably require a special
case rule to prevent it from working. Is there something that leads you to
think that such a special case rule was forgotten, rather than the natural
outcome being expected?

[Bug libstdc++/99021] New: coroutine_handle<_Promise>::from_address() should be noexcept

2021-02-09 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99021

Bug ID: 99021
   Summary: coroutine_handle<_Promise>::from_address() should be
noexcept
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

While it isn't required by the standard, it seems odd that 
coroutine_handle::from_address() is noexcept while the non-specialized
version isn't when they are the same code. This would also help with
clang/libstdc++ compatibility because they currently check for noexcept of
from_address() as part of validating
https://eel.is/c++draft/dcl.fct.def.coroutine#15 (I'll be filing a bug with
them about that shortly since that is leaking an internal compiler
implementation detail).

[Bug libstdc++/99021] coroutine_handle<_Promise>::from_address() should be noexcept

2021-02-09 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99021

--- Comment #1 from Mathias Stearn  ---
clang bug for reference https://bugs.llvm.org/show_bug.cgi?id=49109

[Bug libstdc++/99021] coroutine_handle<_Promise>::from_address() should be noexcept

2021-02-09 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99021

--- Comment #3 from Mathias Stearn  ---
Thanks for the quick fix!

Any chance of a backport of this and
https://github.com/gcc-mirror/gcc/commit/f1b6e46c417224887c2f21baa6d4c538a25fe9fb#diff-ed08df78eba81189642b4e8d670a0adb4b377db6846aecb090b4dce52a9251fa
to the v10 branch? It will improve the experience of anyone using clang-based
editor tooling when using libstdc++ rather than libc++.

[Bug c++/99047] New: ICE on simple task coroutine example

2021-02-09 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=99047

Bug ID: 99047
   Summary: ICE on simple task coroutine example
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/ecxEbb Code compiles on MSVC and clang, and at least on
clang produces the expected output.

#include 
#include 

template 
struct [[nodiscard]] task {
struct promise_type  {
std::suspend_always initial_suspend() {
return {};
}
auto final_suspend() noexcept {
struct awaiter {
std::false_type await_ready() noexcept {
return {};
}
std::coroutine_handle<> await_suspend(std::coroutine_handle<>)
noexcept {
return next;
}
void await_resume() noexcept {
}
std::coroutine_handle<> next;
};
return awaiter{next};
}

void unhandled_exception() noexcept {
std::terminate();
}
auto get_return_object() {
return task(this);
}
auto coro() {
return std::coroutine_handle::from_promise(*this);
}
void return_value(T val) {
result.emplace(std::move(val));
}

std::coroutine_handle<> next;
std::optional result;
};

task(task&& source) : p(std::exchange(source.p, nullptr)) {}
explicit task(promise_type* p) : p(p) {}
~task() {
if (p)
p->coro().destroy();
}

bool await_ready() noexcept {
return p->coro().done();
}
std::coroutine_handle<> await_suspend(std::coroutine_handle<> next)
noexcept {
p->next = next;
return p->coro();
}
const T& await_resume() const& noexcept {
return *p->result;
}

promise_type* p;
};

task five() {
co_return 5;
}
task six() {
co_return (co_await five()) + 1;
}


int main() {
auto task = six();
task.p->next = std::noop_coroutine();
task.p->coro().resume();
return *task.p->result;
}

: In function 'void _Z4fivev.actor(five()::_Z4fivev.frame*)':
:92:11: internal compiler error: in fold_convert_loc, at
fold-const.c:2430
   92 | task five() {
  |   ^~~~
0x1ce6f09 internal_error(char const*, ...)
???:0
0x6b6f43 fancy_abort(char const*, int, char const*)
???:0
0xc92cd4 fold_convert_loc(unsigned int, tree_node*, tree_node*)
???:0
0xd126ae gimple_boolify(tree_node*)
???:0
0xd1b720 gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*),
int)
???:0
0xd1bb4a gimplify_expr(tree_node**, gimple**, gimple**, bool (*)(tree_node*),
int)

[snipping many more frames]

[Bug c++/100876] New: -Wmismatched-new-delete should either look through or ignore placement new

2021-06-02 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100876

Bug ID: 100876
   Summary: -Wmismatched-new-delete should either look through or
ignore placement new
   Product: gcc
   Version: 11.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/KTTMrEGns

Example code:
free(new (malloc(4)) int()); // Warns but shouldn't
delete new (malloc(4)) int(); // Doesn't warn but should

output:

:5:9: warning: 'void free(void*)' called on pointer returned from a
mismatched allocation function [-Wmismatched-new-delete]
5 | free(new (malloc(4)) int()); // Warns but shouldn't
  | ^~~
:5:30: note: returned from 'void* operator new(std::size_t, void*)'
5 | free(new (malloc(4)) int()); // Warns but shouldn't
  |  ^

While it would be nice to have a warning on the second line, not warning on the
first seems more important. And hopefully is a backportable fix.

Here is some Real World Code exhibiting this pattern that g++ currently warns
about when compiling:
https://github.com/facebook/hermes/blob/dfef1abd6d20b196e24c591e225a7003e6337a94/unittests/VMRuntime/StringPrimitiveTest.cpp#L221-L235.
There is also an example using calloc() lower in that file.

[Bug c++/89997] Garbled expression in error message with -fconcepts

2022-10-05 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89997

--- Comment #3 from Mathias Stearn  ---
Please reopen. It still seems to be broken with -std=c++20 as the only flag:
https://godbolt.org/z/bWMq4s6xb (trunk) https://godbolt.org/z/W3xWjWaGe (12.2)

Output:

: In function 'void test()':
:16:15: error: no matching function for call to 'check()'
   16 | check(); // mangled error
  | ~~^~
:12:6: note: candidate: 'template void check() requires
requires(X x, T val) {x.X::operator<<(const char*)("hello") << val;}'
   12 | void check() requires requires (X x, T val) { x << "hello" << val; } {}
  |  ^
:12:6: note:   template argument deduction/substitution failed:
:12:6: note: constraints not satisfied
: In substitution of 'template void check() requires
requires(X x, T val) {x.X::operator<<(const char*)("hello") << val;} [with T =
int]':
:16:15:   required from here
:12:6:   required by the constraints of 'template void check()
requires requires(X x, T val) {x.X::operator<<(const char*)("hello") << val;}'
:12:23:   in requirements with 'X x', 'T val' [with T = int]
:12:60: note: the required expression '("hello"->x.X::operator<<() <<
val)' is invalid
   12 | void check() requires requires (X x, T val) { x << "hello" << val; } {}
  |   ~^~
cc1plus: note: set '-fconcepts-diagnostics-depth=' to at least 2 for more
detail
Compiler returned: 1


The last line with  still says "the required expression
'("hello"->x.X::operator<<() << val)' is invalid". It should not be trying to
apply -> to a string literal. The following line with carrot and underline is
very helpful and shows what the problem is. But the "note" line seems actively
harmful since it is showing an expression that would never be valid for any
type. It seems like it would be better to remove that line than attempting to
show it if you can't reproduce the correct expression.

[Bug c++/89997] Garbled expression in error message with -fconcepts

2022-10-05 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=89997

--- Comment #6 from Mathias Stearn  ---
> I think this is probably not concepts-specific, and just another variant of 
> PR 49152.

Perhaps the busted pretty printer is a general problem, but at least in this
case I think the fix may be in concepts code. It looks like the error is
generated at
https://github.com/gcc-mirror/gcc/blob/16e2427f50c208dfe07d07f18009969502c25dc8/gcc/cp/constraint.cc#L3300
(or the similar call 7 lines lower). Given that gcc already prints the source
loc with the invalid expression, I think you can just remove the %qE to improve
the diagnostic output. (I don't know the gcc codebase at all, so I could be
wrong about this, I just grepped for the string "the required expression")

[Bug libstdc++/104223] New: GCC unable to inline trivial functions passed to views::filter and transform unless lifted into types

2022-01-25 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104223

Bug ID: 104223
   Summary: GCC unable to inline trivial functions passed to
views::filter and transform unless lifted into types
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

I'm not sure if this is an issue with the optimizer or the way that the library
code is written, or some combination of the two, but the end result seems
unfortunate. with() and without() are logically the same function, the only
difference is that with() lifts the function pointers into types using a
templated lambda variable, while without() just passes the function names
directly to the library. It seems interesting that the optimizer knows that
they are "constant enough" to emit direct rather than indirect calls to t() and
f(), however, it isn't constant enough to inline those calls.


https://godbolt.org/z/EqWzzh916
Flags: -std=c++2b -O3
Reproduces on at least 11.2 and trunk

#include 

namespace views = std::views;

void trace() noexcept;
inline int f(int i) {
trace();
return i;
}

inline bool t(int) { return true; }

// for some reason gcc needs this in order to inline f() and t()
template 
auto typify = [] (int i) { return f(i); };

void with() {
for (auto&& x : views::single(1) | views::transform(typify) |
views::filter(typify)) {}
}

void without() {
for (auto&& x : views::single(1) | views::transform(f) | views::filter(t))
{}
}

with():
sub rsp, 8
calltrace()
add rsp, 8
jmp trace()
without():
sub rsp, 8
mov edi, 1
callf(int)
mov edi, eax
callt(int)
testal, al
jne .L10
add rsp, 8
ret
.L10:
mov edi, 1
add rsp, 8
jmp f(int)

[Bug c++/102363] New: source_location in await_transform has function_name referring to internal coroutine funclet rather than source-level function

2021-09-16 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102363

Bug ID: 102363
   Summary: source_location in await_transform has function_name
referring to internal coroutine funclet rather than
source-level function
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

When a promise_type::await_transform() has a defaulted argument of
source_location::current(), it refers to the location of the co_await that
triggered the await_transform. This is really useful for building runtime
diagnostics and tracing. The file/line/column fields all seem to be filled out
correctly, however, the function_name field refers to an internal funclet that
the coroutine function is split into. I think it would be more useful if it had
the original corountine function name since that is what users will be
expecting.

Example: https://godbolt.org/z/sbMsPG8Eq

Output:
  good: /app/example.cpp:20:43: example func()
  bad: /app/example.cpp:25:14: void func(func()::_Z4funcv.frame*)

Source (compiled with just -std=c++20):
#include 
#include 
#include 
struct example {
struct promise_type {
std::suspend_never initial_suspend() { return {}; }
std::suspend_never final_suspend() noexcept { return {}; }
void unhandled_exception() { throw; }
example get_return_object() { return {}; }
void return_void() {}

auto await_transform(const char* dummy, std::source_location l =
std::source_location::current()) {
printf("bad: %s:%u:%u: %s\n", l.file_name(), l.line(), l.column(),
l.function_name());
return std::suspend_never();
}
};
};

example func() {
auto l = std::source_location::current();
printf("good: %s:%u:%u: %s\n", l.file_name(), l.line(), l.column(),
l.function_name());

// bad location points here:
//   v
co_await "pretend this is an awaitable";
}

int main() {
func();
}

[Bug web/102365] New: Function attribute docs should have an anchor or id on each attribute.

2021-09-16 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102365

Bug ID: 102365
   Summary: Function attribute docs should have an anchor or id on
each attribute.
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: web
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html

This would make it easier to link to the docs for a specific attribute. There
is currently an anchor on the  tag (eg
https://gcc.gnu.org/onlinedocs/gcc-4.7.2/gcc/Function-Attributes.html#index-g_t_0040code_007bartificial_007d-function-attribute-2500
for artificial), but that will scroll down so that the name of the attribute
isn't shown. Ideally the anchor should be on the  tag instead. A nice
enhancement would be to make the attribute name be a self-link to make it easy
to get the link without needing to poke in the browser's inspect facility.

[Bug c++/100493] Lambda default copy capture that captures "this" cannot be used in both C++17 and C++20 modes

2021-09-22 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100493

Mathias Stearn  changed:

   What|Removed |Added

 CC||redbeard0531 at gmail dot com

--- Comment #2 from Mathias Stearn  ---
This is making it harder for us to transition to -std=c++20, since we can't
have the same code be warning-clean in both versions. I really don't think the
warning for [=, this] is helpful given that it is the pattern that is now
recommended.

[Bug c++/100493] Lambda default copy capture that captures "this" cannot be used in both C++17 and C++20 modes

2021-09-22 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=100493

--- Comment #3 from Mathias Stearn  ---
When reading https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82611, I noticed that
C++17 actually requires the warning on [=, this] from a conforming
implementation:
https://timsong-cpp.github.io/cppwp/n4659/expr.prim.lambda.capture#2. However,
given that this requirement was lifted in C++20, would it be possible to only
warn about that in -std=c++17 (or lower) with -pedantic? It seems
counterproductive to warn about it without -pedantic.

[Bug c++/102528] New: Unable to inline even trivial coroutines

2021-09-29 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102528

Bug ID: 102528
   Summary: Unable to inline even trivial coroutines
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/aoab9W4xG

This should all compile away, and test() should just be a single ret
instruction. That is not what happens now, even with -O3.

#include 

struct simple {
struct promise_type {
void return_void() {}
std::suspend_never initial_suspend() { return {}; }
std::suspend_never final_suspend() noexcept { return {}; }
void unhandled_exception() { throw; }
simple get_return_object() { return {}; }
};

std::true_type await_ready() { return {}; }
void await_suspend(std::coroutine_handle<>) {}
void await_resume();
};

inline simple test1() {
co_return;
}

inline simple test2() {
co_await test1();
co_return;
}

void test() {
test2();
}

[Bug c++/102528] Unused out-of-line functions emitted for trivial coroutines

2021-09-29 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102528

--- Comment #1 from Mathias Stearn  ---
Sorry, there was a typo in the initial code. I forgot the trivial
implementation of await_resume(). (D'oh!)

Now I can see that test() is just a ret instruction, but there is still a lot
of dead code emitted for the coroutine functions. While it isn't too much for
the trivial functions, for a real use case it would be. And it seems like a bug
anyway  that unused code makes it to the obj file in -O3:
https://godbolt.org/z/bTncofrzd

#include 

struct simple {
struct promise_type {
void return_void() {}
std::suspend_never initial_suspend() { return {}; }
std::suspend_never final_suspend() noexcept { return {}; }
void unhandled_exception() { throw; }
simple get_return_object() { return {}; }
};

std::true_type await_ready() {return {}; }
void await_suspend(std::coroutine_handle<>) {}
void await_resume() {}
};

inline simple test1() {
co_return;
}

inline simple test2() {
co_await test1();
co_return;
}

void test() {
test2();
}

test1(test1()::_Z5test1v.frame*) [clone .actor]:
movzx   eax, WORD PTR [rdi+18]
testal, 1
je  .L2
cmp ax, 5
ja  .L3
mov edx, 42
bt  rdx, rax
jnc .L3
.L4:
cmp BYTE PTR [rdi+17], 0
jne .L14
.L1:
ret
.L2:
cmp ax, 2
je  .L5
cmp ax, 4
je  .L4
testax, ax
jne .L3
mov QWORD PTR [rdi+24], rdi
.L5:
cmp BYTE PTR [rdi+17], 0
mov BYTE PTR [rdi+32], 1
mov QWORD PTR [rdi], 0
je  .L1
.L14:
jmp operator delete(void*)
test1(test1()::_Z5test1v.frame*) [clone .actor] [clone .cold]:
.L3:
ud2
test2(test2()::_Z5test2v.frame*) [clone .actor]:
movzx   eax, WORD PTR [rdi+18]
testal, 1
je  .L16
cmp ax, 7
ja  .L17
mov edx, 170
bt  rdx, rax
jnc .L17
.L18:
cmp BYTE PTR [rdi+17], 0
jne .L33
.L15:
ret
.L16:
cmp ax, 4
je  .L19
ja  .L20
testax, ax
jne .L34
mov QWORD PTR [rdi+24], rdi
.L22:
mov BYTE PTR [rdi+32], 1
.L19:
cmp BYTE PTR [rdi+17], 0
mov QWORD PTR [rdi], 0
je  .L15
.L33:
jmp operator delete(void*)
.L34:
cmp ax, 2
je  .L22
jmp .L17
.L20:
cmp ax, 6
je  .L18
jmp .L17
test2(test2()::_Z5test2v.frame*) [clone .actor] [clone .cold]:
.L17:
ud2
test1(test1()::_Z5test1v.frame*) [clone .destroy]:
movzx   eax, WORD PTR [rdi+18]
or  eax, 1
mov WORD PTR [rdi+18], ax
cmp ax, 5
ja  .L36
mov edx, 42
bt  rdx, rax
jnc .L36
cmp BYTE PTR [rdi+17], 0
jne .L39
ret
.L39:
jmp operator delete(void*)
test1(test1()::_Z5test1v.frame*) [clone .destroy] [clone .cold]:
.L36:
ud2
test2(test2()::_Z5test2v.frame*) [clone .destroy]:
movzx   eax, WORD PTR [rdi+18]
or  eax, 1
mov WORD PTR [rdi+18], ax
cmp ax, 7
ja  .L41
mov edx, 170
bt  rdx, rax
jnc .L41
cmp BYTE PTR [rdi+17], 0
jne .L44
ret
.L44:
jmp operator delete(void*)
test2(test2()::_Z5test2v.frame*) [clone .destroy] [clone .cold]:
.L41:
ud2
test():
ret

[Bug ipa/102528] Unused out-of-line functions emitted for trivial coroutines

2021-10-05 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102528

--- Comment #8 from Mathias Stearn  ---
Sorry again about the confusion caused by my typo. I am not able to edit the
comment to make it clear that the comment#0 should be ignored. If that happens
again, would it be better for me to close the ticket and create a new one?

[Bug c++/102876] New: GCC fails to use constant initialization even when it knows the value to initialize

2021-10-21 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102876

Bug ID: 102876
   Summary: GCC fails to use constant initialization even when it
knows the value to initialize
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

See: https://godbolt.org/z/KnKv9j8b9

#include 
using namespace std::literals;
/*not constexpr*/ inline std::chrono::sys_days compute_days() {
return 1776y/7/4;
}
std::chrono::sys_days BURN_IN_TO_BINARY = compute_days();

Clang and MSVC both correctly burn BURN_IN_TO_BINARY into the binary image with
the correct value. Even with -O3, gcc zero-initializes it then uses a dynamic
initializer to complete the initialization. Both are valid implementation
strategies according to https://eel.is/c++draft/basic.start.static#3, however,
I think the strategy used by clang and MSVC is clearly superior QoI here.

[Bug c++/102876] GCC fails to use constant initialization even when it knows the value to initialize

2021-10-21 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102876

--- Comment #3 from Mathias Stearn  ---
> Why not just make the function constexpr though?

That isn't always possible. Sometimes the initializer may call a third-party
function that is inline, but not yet marked constexpr (it may need to support
older language versions where it couldn't be constexpr). Other times the
initializer may call a function that is out of line (so can't be constexpr at
all), but defined in the same TU. MSVC and clang handle this somewhat more
realistic example nicely, gcc doesn't: https://godbolt.org/z/jYKx8359T

The original example using chrono was just something that when reading I
thought "any optimizer worth its salt should be able to do this even without
explicit constexpr annotation". I was disappointed to learn that gcc couldn't,
so I filed a bug in the hope that it can be improved.

[Bug c++/102876] GCC fails to use constant initialization even when it knows the value to initialize

2021-10-26 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=102876

--- Comment #12 from Mathias Stearn  ---
(In reply to Jakub Jelinek from comment #10)
> So we'd just punt at optimizing that, we don't know if b is read or written
> by foo (and, note, it doesn't have to be just the case of explicitly being
> passed address of some var, it can get the address through other means).
> On the other side, we can't optimize b to b: .long 2, because bar can use
> the variable and/or modify it, so by using 2 as static initializer bar would
> be miscompiled.

I'm pretty sure that that is explicitly allowed by
https://eel.is/c++draft/basic.start.static#3, so it should *not* be considered
a miscompilation. The example only shows reading from another static-duration
variable's initializer, but I believe writing would also be covered.

I took a look at how other compilers handle this, and it is somewhat
interesting: https://godbolt.org/z/9YvcbEeax

int foo(int&);
inline int bar() { return 7; }
extern int b;
int z = bar();
int a = foo(b); // comment this out and watch clang change...
int b = bar();
int c = bar();

GCC: always does dynamic init for everything.
MSVC: always does static init for z, b, and c. always dynamic init for a.
Clang: it seems like if it does any dynamic init in the TU, it doesn't promote
any dynamic init to static. So with that code all four variables are
dynamicially initialized, but if you comment out a, the remaining 3 become
static. If you add an unrelated variable that requires dynamic init, those 3
become dynamically initialized again.

I don't understand why clang does what it does. I don't think it is required to
do that by the standard, and it clearly seems suboptimal. So I would rather GCC
behave like MSVC in this case than like clang.

Also note what happens if we provide a definition for foo like `inline int
foo(int& x) { return x += 6; }`: https://godbolt.org/z/sWd6chsnP. Now both MSVC
and Clang will static initialize z, b, and c to 7 and *static* initialize a to
6. GCC gets the same result dynamically, but for some reason tries to load b
prior to adding 6, even though it has to be 0 (barring a UB write to b from
another TU).

[Bug target/105496] New: Comparison optimizations result in unnecessary cmp instructions

2022-05-05 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105496

Bug ID: 105496
   Summary: Comparison optimizations result in unnecessary cmp
instructions
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/1zdYsaqEj

Consider these equivalent functions:

int test1(int x) {
if (x <= 10)
return 123;
if (x == 11)
return 456;
return 789;
}

int test2(int x) {
if (x < 11)
return 123;
if (x == 11)
return 456;
return 789;
}

In test2 it is very clear that you can do a single cmp of x with 11 then use
different flag bits to choose your case. In test1 it is less clear, but because
x<=10 and x<11 are equivalent, you could always transform one to the other.
Clang seems to do this correctly and transforms test1 into test2 and only emits
a single cmp instruction in each. For some reason, not only does gcc miss this
optimization, it seems to go the other direction and transform test2 into
test1, emitting 2 cmp instructions for both!

test1(int):
mov eax, 123
cmp edi, 10
jle .L1
cmp edi, 11
mov eax, 456
mov edx, 789
cmovne  eax, edx
.L1:
ret
test2(int):
mov eax, 123
cmp edi, 10
jle .L6
cmp edi, 11
mov eax, 456
mov edx, 789
cmovne  eax, edx
.L6:
ret

Observed with at least -O2 and -O3. I initially observed this for code where
each if generated an actual branch rather than a cmov, but when I reduced the
example, the cmov was generated.

I'm not sure if this should be a middle-end or target specific optimization,
since ideally it would be smart on all targets that use comparison flags, even
if there are some targets that don't. Is there ever a down side to trying to
make two adjacent comparisons compare the same number?

[Bug c++/105534] New: -Wmaybe-uninitialized shouldn't suppress -Wuninitialized warnings

2022-05-09 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105534

Bug ID: 105534
   Summary: -Wmaybe-uninitialized shouldn't suppress
-Wuninitialized warnings
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

The following function emits a -Wuninitialized warning on ++count with -Wall
https://godbolt.org/z/KfaMEETY1:

int test(bool cond) {
int count;
++count;
return count;
}

Making the increment be conditional changes it to a -Wmaybe-uninitialized
warning, which is suppressed with -Wno-maybe-uninitialized.
https://godbolt.org/z/qarMrqW7E

int test(bool cond) {
int count;
if (cond) ++count;
return count;
}

This makes no sense. count is never initialized on any path through the
function, and it is returned on all paths.

We use -Wall with -Wno-maybe-uninitialized on our codebase because we were
getting too many false-positives with -Wmaybe-initialized, in particular from
third-party headers that we didn't want to modify. At the time we decided to do
that, we didn't realize that we would also be missing out on clearly
uninitialized cases like this.

[Bug c++/105534] -Wmaybe-uninitialized cases shouldn't suppress -Wuninitialized warnings

2022-05-09 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105534

--- Comment #3 from Mathias Stearn  ---
One slightly confusing aspect is that the wording of the flag implies that the
variable may or may not be uninitialzied (because in -Wmaybe-uninitialized
maybe binds directly to uninitialized), while phrasing of the warning message
is about the usage being conditional: "may be used uninitialized". And of
course the documentation (at least the man page) uses a different phrasing:

> For an object with automatic or allocated storage duration, if there exists a 
> path from the function entry to a use of the object that is initialized, but 
> there exist some other paths for which the object is not initialized, the 
> compiler emits a warning if it cannot prove the uninitialized paths are not 
> executed at run time.

For both the initial example with count, and your example with count2, I'd say
that the "there exists a path from the function entry to a use of the object
that is initialized" bit is clearly not satisfied, so if we assume the
documentation is correct, then those cases both lack a "maybe" and the
variables are clearly uninitialized.

This would also match my intuition for -Winitialized which is that it
definitively errors if all paths from declaration to any usage result in the
variable being uninitialized.

PS - This test case is a reduced example from an actual bug that luckily was
found by coverity before release: https://jira.mongodb.org/browse/SERVER-66306.
I dug in to make a repro because I was expecting that we would have gotten a
compiler error on that code before it was even committed. I'm also exploring
whether we can stop passing -Wno-maybe-uninitialized, but it looks like we
still get false positives in third-party headers, so it doesn't seem likely.

[Bug c++/105534] -Wmaybe-uninitialized cases shouldn't suppress -Wuninitialized warnings

2022-05-09 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105534

--- Comment #4 from Mathias Stearn  ---
(In reply to Richard Biener from comment #2)
> Note there's
> 
>   _2 = value_1(D) * x_2;
> 
> where _2 might be effectively uninitialized iff x_2 is not zero.  When x_2
> is zero then _2 has a well-defined value.

Not according to the C++ standard. http://eel.is/c++draft/basic.indet seems
pretty clear that unless x_2 is std::byte (which doesn't support
multiplication) or an "unsigned ordinary character type", then that is UB. 

FWIW I still think I'd expect the warning on "unsigned char x, y; y = x * 0;",
but I would definitiely expect the warning for int.

[Bug c++/105560] New: Spurious -Wunused-local-typedefs warning on a typedef used on returned type

2022-05-11 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105560

Bug ID: 105560
   Summary: Spurious -Wunused-local-typedefs warning on a typedef
used on returned type
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/zcY11ezev

#include 
#include 

template 
typename F::Out call(F&& fun) {
typename F::Out var = fun(1);
return var;
}

void test() {
// Fill builder with bson-encoded elements from view.
auto decorateLambda = [](auto&& lambda) {
using Lambda = std::remove_reference_t;
struct DecoratedLambda : Lambda {
using Out = bool;
};
return DecoratedLambda{std::forward(lambda)};
};
call(decorateLambda([&](auto&& value) {
(void)(value);
return true;
}));
}

With -O2 -std=c++20 -Wall:

: In lambda function:
:15:19: warning: typedef 'using
test()DecoratedLambda::Out = bool' locally defined but
not used [-Wunused-local-typedefs]
   15 | using Out = bool;
  |   ^~~
Compiler returned: 0


However, DecoratedLambda::Out *is* used by call(), both in the signature and in
the body. Perhaps the issue is that typedefs in local types that are returned
(or are reachable from returned types) shouldn't be considered "local"?

[Bug target/105661] New: Comparisons to atomic variables generates less efficient code

2022-05-19 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105661

Bug ID: 105661
   Summary: Comparisons to atomic variables generates less
efficient code
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

With normal variables, gcc will generate a nice cmp/js pair when checking if
the high bit is null. When using an atomic, gcc generates a movzx/test/js
triple, even though it could use the same codegen as for a non-atomic.

https://godbolt.org/z/GorvWfrsh

#include 
#include 

[[gnu::noinline]] void f();
uint8_t plain;
std::atomic atomic;

void plain_test() {
if (plain & 0x80) f();
}

void atomic_test() {
if (atomic.load(std::memory_order_relaxed) & 0x80) f();
}

With both -O2 and -O3 this generates:

plain_test():
cmp BYTE PTR plain[rip], 0
js  .L4
ret
.L4:
jmp f()
atomic_test():
movzx   eax, BYTE PTR atomic[rip]
testal, al
js  .L7
ret
.L7:
jmp f()

ARM64 seems to be hit even harder, but I don't know that platform well enough
to know if the non-atomic codegen is valid there
https://godbolt.org/z/c3h8Y1dan. It seems likely though, at least for a relaxed
load.

[Bug c++/105397] Cannot export module initializer symbols with `-fvisibility=hidden`

2022-06-09 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=105397

Mathias Stearn  changed:

   What|Removed |Added

 CC||redbeard0531 at gmail dot com

--- Comment #1 from Mathias Stearn  ---
Perhaps the best option is to default the visibility of the implicit functions
to the widest visibility of any function or object in module purview exposed by
the TU. The assumption being that if anything is visibile outside the library,
then it is expected to be imported from TUs outside the library and that should
Just Work. Conversely, if everything is defined as internal visibility, then it
is unlikely that this module was intended to be imported from outside of the
library, and so it may be desireable to allow different libs to have their own
module with the same name.

Unfortunately that doesn't give any good indication of what to do for
importable units that have an empty module purview (or where everything inside
it has TU-local internal linkage). While legal, maybe that isn't a case worth
optimizing the Just Works experience for?

[Bug tree-optimization/59429] Missed optimization opportunity in qsort-style comparison functions

2024-03-22 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=59429

--- Comment #16 from Mathias Stearn  ---
Trunk still generates different code for all cases (in some cases subtly so)
for both aarch64 and x86_64: https://www.godbolt.org/z/1s6sxrMWq. For both
arches, it seems like LE and LG generate the best code.

On aarch64, they probably all have the same throughput, but EL and EG have a
size penalty with an extra instruction.

On x86_64, there is much more variety. EL and EG both get end up with a branch
rather than being branchless, which is probably bad since comparison functions
are often called in ways that the result branches are unpredictable. GE and GL
appear to have regressed since this ticket was created. They now do the
comparison twice rather than reusing the flags from the first comparison:

comGL(int, int):
xor eax, eax
cmp edi, esi
mov edx, 1
setlal
neg eax
cmp edi, esi
cmovg   eax, edx
ret
comGE(int, int):
xor eax, eax
cmp edi, esi
mov edx, 1
setne   al
neg eax
cmp edi, esi
cmovg   eax, edx
ret

[Bug libstdc++/111588] New: Provide opt-out of shared_ptr single-threaded optimization

2023-09-25 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111588

Bug ID: 111588
   Summary: Provide opt-out of shared_ptr single-threaded
optimization
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

Right now there is a fast-path for single-threaded programs to avoid the
overhead of atomics in shared_ptr, but there is no equivalent for a program the
knows it is multi-threaded to remove the check and branch. If __GTHREADS is not
defined then no atomic code is emitted.

There are two issues with this: 1) for programs that know they are effectively
always multithreaded they pay for a runtime branch and .text segment bloat for
an optimization that never applies. This may have knock-on effects of making
functions that use shared_ptr less likely to be inlined by pushing them
slightly over the complexity threshold. 2) It invalidates singlethreaded
microbenchmarks of code that uses shared_ptr because the performance of the
code may be very different from when run in the real multithreaded program.

I understand the value of making a fast mode for single-threaded code, and I
can even except having the runtime branch by default, rather than as an opt-in,
when it is unknown if the program will be run with multiple threads. But an
opt-out would be nice to have. If it had to be a gcc-build time option rather
than a #define, that would be acceptable for us since we always use our own
build of gcc, but it seems like a worse option for other users.

FWIW, neither llvm libc++
(https://github.com/llvm/llvm-project/blob/0bfaed8c612705cfa8c5382d26d8089a0a26386b/libcxx/include/__memory/shared_ptr.h#L103-L110)
nor MS-STL
(https://github.com/microsoft/STL/blob/main/stl/inc/memory#L1171-L1173) ever
use runtime branching to detect multithreading.

[Bug libstdc++/111589] New: Use relaxed atomic increment (but not decrement!) in shared_ptr

2023-09-25 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111589

Bug ID: 111589
   Summary: Use relaxed atomic increment (but not decrement!) in
shared_ptr
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

The atomic increment when copying a shared_ptr can be relaxed because it is
never actually used as a synchronization operation. The current thread must
already have sufficient synchronization to access the memory because it can
already deref the pointer. All synchronization is done either via whatever
program-provided code makes the shared_ptr object available to the thread, or
in the atomic decrement (where the decrements to non-zero are releases that
ensure all uses of the object happen before the final decrement to zero
acquires and destroys the object).

As an argument-from-authority, libc++ already is using relaxed for increments
and acq/rel for decements:
https://github.com/llvm/llvm-project/blob/c649fd34e928ad01951cbff298c5c44853dd41dd/libcxx/include/__memory/shared_ptr.h#L101-L121

This will have no impact on x86 where all atomic RMWs are effectively
sequentially consistent, but it will enable the use of ldadd rather than
ldaddal on aarch64, and similar optimizations on other weaker architectures.

[Bug target/104611] memcmp/strcmp/strncmp can be optimized when the result is tested for [in]equality with 0 on aarch64

2023-09-25 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=104611

Mathias Stearn  changed:

   What|Removed |Added

 CC||redbeard0531 at gmail dot com

--- Comment #4 from Mathias Stearn  ---
clang has already been using the optimized memcmp code since v16, even at -O1:
https://www.godbolt.org/z/qEd768TKr. Older versions (at least since v9) were
still branch-free, but via a less optimal sequence of instructions.

GCC's code gets even more ridiculous at 32 bytes, because it does a branch
after every 8-byte compare, while the clang code is fully branch-free (not that
branch-free is always better, but it seems clearly so in this case).

Judging by the codegen, there seems to be three deficiencies in GCC: 1) an
inability to take advantage of the load-pair instructions to load 16-bytes at a
time, and 2) an inability to use ccmp to combine comparisons. 3) using
branching rather than cset to fill the output register. Ideally these could all
be done in the general case by the low level instruction optimizer, but even
getting them special cased for memcmp (and friends) would be an improvement.

[Bug c++/114357] Add a way to not call deconstructors for non-auto decls, like clang's no_destroy attribute

2024-08-05 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114357

Mathias Stearn  changed:

   What|Removed |Added

 CC||redbeard0531 at gmail dot com

--- Comment #9 from Mathias Stearn  ---
For codebases that always use `quick_exit()` or `_Exit()` to exit, all static
destructors are dead code bloat. For constinit function-statics with
non-trivial destructors (that will never run) this results in unnecessary
overhead each time the function is called: https://godbolt.org/z/jsjTYdMf3

> It changes behavior and even could cause memory leaks if abused.

For static duration objects there is no such thing as a memory leak. The whole
address space is going away. Allowing this on thread-local duration objects is
probably a mistake though (with a possible exception for embedded and
kernel-level code that does its own thread teardown).

[Bug middle-end/116383] New: Value from __atomic_store not forwarded to non-atomic load at same address

2024-08-15 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=116383

Bug ID: 116383
   Summary: Value from __atomic_store not forwarded to non-atomic
load at same address
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

https://godbolt.org/z/1bbjoc87n

int test(int* i, int val) {
__atomic_store_n(i, val, __ATOMIC_RELAXED);
return *i;
}

The non-atomic load should be able to directly use the value stored by the
atomic store, but instead GCC issues a new load:

mov DWORD PTR [rdi], esi
mov eax, DWORD PTR [rdi]
ret

Clang recognizes that the load is unnecessary and propagates the value:

mov eax, esi
mov dword ptr [rdi], esi
ret

In addition to simply being an unnecessary load, there is an additionally
penalty in most CPUs from accessing reading a value still in the CPU's store
buffer which it almost certainly would be in this case. And of course this also
disables further optimizations eg DSE and value propagation where the compiler
knows something about the value.

void blocking_further_optimizations(int* i) {
if (test(i, 1) == 0) {
__builtin_abort();
}
}

generates the following with gcc

mov DWORD PTR [rdi], 1
mov edx, DWORD PTR [rdi]
testedx, edx
je  .L5
ret
blocking_further_optimizations(int*) [clone .cold]:
.L5:
pushrax
callabort

And this much better output with clang

mov dword ptr [rdi], 1
ret

While I'm using a relaxed store here to show that gcc doesn't apply the
optimization in that case, I think the optimization should apply regardless of
memory ordering (and clang seems to agree). Also while the minimal example code
is contrived, there are several real-world use cases where this pattern can
come up. I would expect it in cases where there is a single writer thread but
many reader threads. The writes and off-thread reads need to use __atomic ops
to avoid data races, but on-thread reads should be safe using ordinary loads,
and you would want them to be optimized as much as possible.

[Bug other/111976] New: Large constant zero-valued objects should go in .bss rather than .rodata

2023-10-25 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111976

Bug ID: 111976
   Summary: Large constant zero-valued objects should go in .bss
rather than .rodata
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

Right now constant objects always go in .rodata. This is nice because it will
give a nice loud error if you ever write to it. However, .rodata takes up space
in the binary file and in memory at runtime. If you instead put it in .bss it
takes up no space in the binary file and, at least on linux, it gets backed by
a single zero-filled page of physical memory. Ideally there would be something
like .robss which gave you the best of both worlds, but this is admittedly
niche for all the effort to add a new section like that. I think the best
option is to leave it in .rodata for non-optimized builds to catch bugs, but
when optimizing, especially with -Os, put it in .bss.

Repro https://www.godbolt.org/z/3rWvTrsTv:

constexpr int GB = 1024*1024*1024;

// Goes in .bss - no space in binary or runtime.
char nonconst_zeros[GB] = {};

// Goes in .rodata - expensive in binary size and runtime memory.
extern const char const_zeros[GB]; // force usage
const char const_zeros[GB] = {};


.globl  const_zeros
.section.rodata
.align 32
.type   const_zeros, @object
.size   const_zeros, 1073741824
const_zeros:
.zero   1073741824


.globl  nonconst_zeros
.bss
.align 32
.type   nonconst_zeros, @object
.size   nonconst_zeros, 1073741824
nonconst_zeros:
.zero   1073741824

[Bug other/111976] Large constant zero-valued objects should go in .bss rather than .rodata

2023-10-25 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111976

--- Comment #1 from Mathias Stearn  ---
Just to be clear, I think we should only do this for "large" objects or
collections of objects. If you did it for small objects in general, there is a
risk that they will spread out mutable data that is written to over more pages,
so that you end up with more runtime anonymous pages, rather than read-only
pages backed by the file cache, so they end up being more expensive. I think a
good rule to prevent this would be to only do it for objects larger than a
page, and possibly add page alignment to them. It may also be possible to
collect enough smaller objects together to make it worth doing this. Not sure
how often that occurs in practice though.

Also at -Os it may make sense to do this for any size object putting since
small objects in .bss will still shrink the binary size. Not sure how users of
-Os would react to tradeoffs involving runtime memory consumption vs binary
size.

[Bug libstdc++/111588] Provide opt-out of shared_ptr single-threaded optimization

2023-10-27 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111588

--- Comment #4 from Mathias Stearn  ---
So even though I already knew in the back of my mind about how this can affect
benchmark results, I *still* got burned by it! I was benchmarking a simple
hazard pointer implementation against shared ptrs by having them crab-walk a
list of 1000 pointers. This was an admittedly simple and unrealistic benchmark
that only ever accessed the objects from a single thread and never had any
contention. It was a few hours after getting the initial results that I
remembered this issue and went back to add a bg thread.

> This needs numbers, not opinions

While my biggest concern was misleading benchmarks (which I now feel a bit
validated by my own mistake :), here are the numbers I see. I added
boost::shared_ptr on the assumption (unvalidated) that the primary difference
would be that it unconditionally uses atomics.

---
Benchmark Time CPU   Iterations
---
BM_scanLinks_HazPtr6420 ns 6420 ns   108948
BM_scanLinks_BoostSharedPtr   11223 ns11223 ns62380
BM_scanLinks_StdSharedPtr  3217 ns 3217 ns   217621
BM_scanLinks_AtomicSharedPtr  28920 ns28920 ns24200

And with a bg thread doing nothing but sleeping:

---
Benchmark Time CPU   Iterations
---
BM_scanLinks_HazPtr6887 ns 6887 ns   101445
BM_scanLinks_BoostSharedPtr   11224 ns11224 ns62373
BM_scanLinks_StdSharedPtr 14221 ns14221 ns49221
BM_scanLinks_AtomicSharedPtr  49574 ns49573 ns14126

[Bug libstdc++/111588] Provide opt-out of shared_ptr single-threaded optimization

2023-10-27 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=111588

--- Comment #5 from Mathias Stearn  ---
Mea culpa. The difference between boost and std was due to the code to
fast-path shared_ptrs that aren't actually shared:
https://github.com/gcc-mirror/gcc/blob/be34a8b538c0f04b11a428bd1a9340eb19dec13f/libstdc%2B%2B-v3/include/bits/shared_ptr_base.h#L324-L362.
I still think that optimization is a good idea, even if it looks bad in this
specific microbenchmark. When that is disabled, they have the same perf, even
with the check for single-threaded.

That said, I'd still love an opt out. For now, I'll just propose that we add a
do-nothing bg thread in our benchmark main() to avoid misleading results.

[Bug middle-end/113585] New: Poor codegen turning int compare into -1,0,1

2024-01-24 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113585

Bug ID: 113585
   Summary: Poor codegen turning int compare into -1,0,1
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: middle-end
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

https://www.godbolt.org/z/Y31xG7EeT

These two functions should be equivalent, but comp2() produces better code than
comp1() on both arm64 and x86_64

int comp1(int a, int b) {
return a == b ? 0 : (a < b ? -1 : 1);
}

int comp2(int a, int b) {
return a < b ? -1 : (a > b ? 1 : 0);
}

arm64:
comp1(int, int):
cmp w0, w1
mov w0, -1
csinc   w0, w0, wzr, lt
cselw0, w0, wzr, ne
ret
comp2(int, int):
cmp w0, w1
csetw0, gt
csinv   w0, w0, wzr, ge
ret

x86_64:
comp1(int, int):
xor eax, eax
cmp edi, esi
je  .L1
setge   al
movzx   eax, al
lea eax, [rax-1+rax]
.L1:
ret
comp2(int, int):
xor eax, eax
cmp edi, esi
mov edx, -1
setgal
cmovl   eax, edx
ret

In the arm case, I suspect that the perf will be equivalent, although comp1 has
an extra instruction, so will pay a size penalty. On x86, comp2 is branchless
while comp1 has a branch which could be problematic if not predictable. It
seems like there should be a normalization pass to convert these functions (and
other equivalent ones) into a single normalized representation so that they get
the same codegen.


PS: I tried another equivalent function and it produces even worse codegen on
x86_64, comparing the same registers twice:

int comp3(int a, int b) {
return a > b ? 1 : (a == b ? 0 : -1);
}

comp3(int, int):
xor eax, eax
cmp edi, esi
mov edx, 1
setne   al
neg eax
cmp edi, esi
cmovg   eax, edx
ret

[Bug other/113682] New: Branches in branchless binary search rather than cmov/csel/csinc

2024-01-31 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

Bug ID: 113682
   Summary: Branches in branchless binary search rather than
cmov/csel/csinc
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: other
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

I've been trying to eliminate unpredictable branches in a hot function where
perf counters show a high mispredict percentage. Unfortunately, I haven't been
able to find an incantation to get gcc to generate branchless code other than
inline asm, which I'd rather avoid. In this case I've even laid out the
critical lines so that they exactly match the behavior of the csinc and csel
instructions on arm64, but they are still not used.


Somewhat minimized repro:

typedef unsigned long size_t;
struct ITEM {char* addr; size_t len;};
int cmp(ITEM* user, ITEM* tree);

size_t bsearch2(ITEM* user, ITEM** tree, size_t treeSize) {
auto pos = tree;
size_t low = 0;
size_t high = treeSize;
while (low < high) {
size_t i = (low + high) / 2;
int res = cmp(user, tree[i]);

// These should be cmp + csinc + csel on arm
// and lea + test + cmov + cmov on x86.
low = res > 0 ? i + 1 : low; // csinc
high = res < 0 ? i: high; // csel

if (res == 0) return i;
}
return -1;
}


On arm64 that generates a conditional branch on res > 0:
bl  cmp(ITEM*, ITEM*)
cmp w0, 0
bgt .L15 // does low = i + 1 then loops
mov x20, x19
bne .L4 // loop


On x86_64 it does similar:
callcmp(ITEM*, ITEM*)
testeax, eax
jg  .L16 
jne .L17


Note that clang produces the desired codegen for both:

arm:
bl  cmp(ITEM*, ITEM*)
cmp w0, #0
csinc   x23, x23, x22, le
cselx19, x22, x19, lt
cbnzw0, .LBB0_1

x86:
callcmp(ITEM*, ITEM*)@PLT
lea rcx, [r12 + 1]
testeax, eax
cmovg   r13, rcx
cmovs   rbx, r12
jne .LBB0_1

(full output for all 4 available at https://www.godbolt.org/z/aWrKbYPTG. Code
snippets from trunk, but also repos on 13.2)

While ideally gcc would generate the branchless output for the supplied code,
if there is some (reasonable) incantation that would cause it to produce
branchless output, I'd be happy to have that too.

[Bug middle-end/113682] Branches in branchless binary search rather than cmov/csel/csinc

2024-02-01 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113682

--- Comment #5 from Mathias Stearn  ---
(In reply to Andrew Pinski from comment #2)
> I should note that on x86, 2 cmov in a row might be an issue and worse than
> branches. There is a cost model and the x86 backend rejects that.
> 
> There are some cores where it is worse. I don't know if it applies to recent
> ones though.

Do you know if that applies to any cores that support x86_64? I checked Agner
Fog's tables, and only very very old cores (P4 era) had high reciprocal
throughput, but even then it was less than latency. It looks like all AMD cores
and intel cores newer than ivy bridge (ie everything from the last 10 years)
are able to execute multiple CMOVs per cycle (reciprocal throughput < 1). From
what I can see, it looks like bad CMOV was a particular problem of the Pentium
4 and Prescott cores, and possibly PPro, but I don't see the numbers for it. I
don't think any of those cores should have an impact on the default cost model
in 2024.

[Bug target/113723] New: -freorder-blocks-and-partition emits bogus asm directives on aarch64

2024-02-02 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113723

Bug ID: 113723
   Summary: -freorder-blocks-and-partition emits bogus asm
directives on aarch64
   Product: gcc
   Version: 14.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: target
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

It tries to subtract labels across the function split which the assembler
rejects. At the very least it does this when generating a switch jump table,
but there may be other cases.

https://www.godbolt.org/z/Ynh1zzY4a

Repro with -O3 -freorder-blocks-and-partition:


[[gnu::cold]]void cold();

template void f();
#define GENERATE_CASE case __COUNTER__: f<__COUNTER__>(); break 

void test(int i) {
switch(i) {
case 1: cold(); break;
GENERATE_CASE;
GENERATE_CASE;
GENERATE_CASE;
GENERATE_CASE;
GENERATE_CASE;
GENERATE_CASE;
GENERATE_CASE;
GENERATE_CASE;
GENERATE_CASE;
GENERATE_CASE;
}
}

test(int):
cmp w0, 18
bls .L17
.L1:
ret
.L17:
adrpx1, .L4
add x1, x1, :lo12:.L4
ldrbw1, [x1,w0,uxtw]
adr x0, .Lrtx4
add x1, x0, w1, sxtb #2
br  x1
.Lrtx4:
.L4:
.byte   (.L14 - .Lrtx4) / 4
.byte   (.L13 - .Lrtx4) / 4 //  THIS IS THE BAD ONE
.byte   (.L12 - .Lrtx4) / 4

...

test(int) [clone .cold]:
.L13:
b   cold()


/tmp/cck05x4u.s: Assembler messages:
/tmp/cck05x4u.s:48: Error: invalid operands (.text.unlikely and .text sections)
for `-'
Compiler returned: 1

I've repro'd on 11,12, and trunk.

[Bug target/113723] -freorder-blocks-and-partition emits bogus asm directives on aarch64

2024-02-02 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=113723

--- Comment #1 from Mathias Stearn  ---
As a workaround sticking [[gnu::optimize("no-reorder-blocks-and-partition")]]
 on each function that triggers this seems to work. But that is obviously not a
scalable solution.

[Bug libstdc++/117655] std::string::swap() could be much faster and smaller

2024-11-18 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117655

--- Comment #2 from Mathias Stearn  ---
It looks like a similar optimization would make sense for operator=(string&&):
https://godbolt.org/z/Wo19fjKeK. It might also make sense to just copy the
whole buffer for the local case in operator=(const string&), but the win is
probably less than the rval and swap optimizations.

[Bug libstdc++/117655] New: std::string::swap() could be much faster and smaller

2024-11-18 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117655

Bug ID: 117655
   Summary: std::string::swap() could be much faster and smaller
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

TL;DR: it should swap the object (or value) representations unconditionally,
then do (ideally branchless) fixups for strings that were self-referential. It
could generate 20 lines of branchless x64_64 asm assembling to 43 bytes of
instructions, while today it is 426 very branchy lines and 1424 bytes.

The current implementation branches into 6(!) different implementations based
on whether the source and destination are local (ie self-referential) strings
and based on whether those self-referential strings are empty. Furthermore,
when copying the self-referential strings, it uses a variable-sized copy which
becomes a branch tree on the size. I think that none of that branching is
necessary* and is likely to be slower on modern hardware than just
unconditionally swapping all 32 bytes (or less) of the representation, then
repointing the pointers to be self-referential when they point to the other's
buffer.

This shows the codegen for a hastily written prototype of the optimized swap
under the current implementation: https://godbolt.org/z/jrYdWEEsY

* Caveat: I'm only talking about std::string, in particular with the default
allocator. The current implementation (or something similar) may still be
important for non-default allocators, especially where
propagate_on_container_swap is false. However, that shouldn't prevent
optimizing for the common case of using the default allocator.

[Bug libstdc++/117650] New: __glibcxx_assert_fail should be marked __attribute__((cold))

2024-11-18 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=117650

Bug ID: 117650
   Summary: __glibcxx_assert_fail should be marked
__attribute__((cold))
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

The cold attribute marks incoming edges much more strongly cold than
__builtin_expect (which IIRC currently assumes only a 90/10 bias). It even
appears to be stronger than __builtin_expect_with_probability when passed
0/100, and is treated closer to how a thrown exception is, eg by splitting the
function into hot and cold parts, so that the cold part doesn't waste icache
and iTLB space that could go to another function. Since __glibcxx_assert_fail
should be even less frequent than a thrown exception (ideally never, if
software didn't have all those pesky bugs...), it should be optimized against
at least as strongly. 

https://godbolt.org/z/svjPzrf5z Shows the impact on codegen by making
__GLIBCXX_NORETURN imply cold in addition to noreturn. This seemed like the
easiest way to demonstrate this. It also might be worth considering since
noreturn functions tend to be cold ones. We have noreturn imply cold in our
codebase, FWIW.


PS- ABI concerns may make it too late to change this, but it may also be worth
considering having the fail function take a single pointer to a statically
allocated struct with all of the arguments, rather than generating them as
separate arguments. As you can see, there are 4 instructions filling in the
arguments where it could just be 1. Ideally there would be some way to encode
the string addresses such that they don't need runtime relocations at startup,
eg, by encoding them as the offset from the static struct pointer to the
string, then decoding them only in the fail function.

[Bug c++/119394] New: Add fixit hint suggesting adding () to [] noexcept {} in C++ < 23

2025-03-20 Thread redbeard0531 at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119394

Bug ID: 119394
   Summary: Add fixit hint suggesting adding () to [] noexcept {}
in C++ < 23
   Product: gcc
   Version: 15.0
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: c++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: redbeard0531 at gmail dot com
  Target Milestone: ---

Note that clang alread does this: https://godbolt.org/z/73T38rWM7

void foo(auto){}
void test() {
foo([] noexcept {});
}

No fixit when running `g++ -std=c++20 -Wall -Werror`:

: In function 'void test()':
:3:12: error: parameter declaration before lambda exception
specification only optional with '-std=c++23' or '-std=gnu++23'
[-Werror=c++23-extensions]
3 | foo([] noexcept {});
  |^~~~
cc1plus: all warnings being treated as errors

Clang20 output with fixit:

:3:12: error: lambda without a parameter clause is a C++23 extension
[-Werror,-Wc++23-extensions]
3 | foo([] noexcept {});
  |^
  |() 
1 error generated.

It might also be better to just replace the technical sounding "parameter
declaration before lambda exception specification" with just "() before
noexcept on a lambda" which I think will be easier for the mythical "normal C++
developer" to understand.