[Bug libstdc++/59439] New: std::locale uses atomic instructions on construction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59439 Bug ID: 59439 Summary: std::locale uses atomic instructions on construction Product: gcc Version: 4.8.3 Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: ben.maurer at gmail dot com In a large multithreaded program that uses stringstream to stringify integers, I have seen that construction and deconstruction of std::locale can take over 10% of the runtime according to the perf tool. In bug 40088, std::locale was optimized to avoid the usage of a mutex when creating the default locale. In our program, this path is being exercised. However, the act of doing ref counting still has a very large performance penalty on SMP systems due to cache line bouncing. If this locale is truly readonly, it would be much better if refcounting could be avoided. Maybe a refcount of -1 could signify "this object need not be refcounted".
[Bug libstdc++/59439] std::locale uses atomic instructions on construction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59439 --- Comment #1 from Ben Maurer --- Facebook is putting a $50 bounty on this bug via bountysource: https://www.bountysource.com/issues/1350875
[Bug libstdc++/59439] std::locale uses atomic instructions on construction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59439 --- Comment #3 from Ben Maurer --- Created attachment 31405 --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31405&action=edit Benchmark _build/opt/experimental/bmaurer/benchmark snprintf 1 threads took 20 ms 2 threads took 20 ms 3 threads took 20 ms 4 threads took 20 ms 5 threads took 22 ms .. 31 threads took 43 ms iostream 1 threads took 108 ms 2 threads took 219 ms 3 threads took 371 ms 4 threads took 451 ms 5 threads took 559 ms 6 threads took 655 ms 7 threads took 763 ms 8 threads took 908 ms 9 threads took 1015 ms ... 31 threads took 3176 ms
[Bug libstdc++/59439] std::locale uses atomic instructions on construction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59439 --- Comment #4 from Ben Maurer --- Also, here's where perf says time is being spent. While only 25% is shown as being in the locale constructor/destructor, I suspect that the time spent in other methods is actually related to the ping-ponging of cachelines caused by the constructor -- all the time is being spent in memory access which should be very hot in the CPU cache. 16.77% benchmark libstdc++.so.6.0.18 [.] std::locale::locale() 10.42% benchmark libstdc++.so.6.0.18 [.] std::locale::~locale() 9.93% benchmark libstdc++.so.6.0.18 [.] bool std::has_facet > > >(std::locale const&) 9.84% benchmark libstdc++.so.6.0.18 [.] std::num_put > > const& std::use_facet > > >(std::locale const&) 9.26% benchmark libstdc++.so.6.0.18 [.] bool std::has_facet > > >(std::locale const&) 9.10% benchmark libstdc++.so.6.0.18 [.] std::num_get > > const& std::use_facet > > >(std::locale const&) 8.78% benchmark libstdc++.so.6.0.18 [.] std::ctype const& std::use_facet >(std::locale const&) 6.14% benchmark libstdc++.so.6.0.18 [.] std::locale::operator=(std::locale const&) 3.71% benchmark libstdc++.so.6.0.18 [.] std::__use_cache >::operator()(std::locale const&) const 3.66% benchmark libstdc++.so.6.0.18 [.] bool std::has_facet >(std::locale const&) 3.42% benchmark libstdc++.so.6.0.18 [.] __dynamic_cast 1.96% benchmark libstdc++.so.6.0.18 [.] std::locale::id::_M_id() const 0.89% benchmark benchmark[.] doIoStream()
[Bug libstdc++/59439] std::locale uses atomic instructions on construction
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59439 --- Comment #6 from Ben Maurer --- I agree this isn't necessarily the cleanest way to accomplish this task. My discovery of this bug came through a real life application which was deployed to real users and had a noticeable impact due to this issue. This isn't benchmarking for the sake of benchmarking, an actual developer decided to write code this way. In the context of their code, choosing stringstream made slightly more sense because they were combining multiple integers and strings. However, bug 40088 suggests that other users may have run into this problem in the past. Given that iostream shows slower, but still reasonable performance in the single-thread case, it's possible others will run into it in the future. Machines will only be getting more cores in the future, so this problem may increase in impact over time. In any case, I've fixed the original application, and am filing this bug in the hopes that we can save others debugging time in the future finding bottlenecks like this.
[Bug tree-optimization/78103] New: Failure to optimize with __builtin_clzl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78103 Bug ID: 78103 Summary: Failure to optimize with __builtin_clzl Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: tree-optimization Assignee: unassigned at gcc dot gnu.org Reporter: ben.maurer at gmail dot com Target Milestone: --- constexpr unsigned long findLastSet(unsigned long x) { return x ? 8 * sizeof(unsigned long) - __builtin_clzl(x) : 0; } constexpr unsigned long findLastSet2(unsigned long x) { return x ? ((8 * sizeof(unsigned long) - 1) ^ __builtin_clzl(x)) + 1 : 0; } These two functions are the same, but GCC does a better job at compiling the second vs the more idiomatic first https://godbolt.org/g/B2x5iG findLastSet(unsigned long): xor eax, eax testrdi, rdi je .L1 bsr rdi, rdi mov eax, 64 xor rdi, 63 movsx rdi, edi sub rax, rdi .L1: rep ret findLastSet2(unsigned long): xor eax, eax testrdi, rdi je .L6 bsr rdi, rdi movsx rax, edi add rax, 1
[Bug tree-optimization/78103] Failure to optimize with __builtin_clzl
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78103 --- Comment #1 from Ben Maurer --- Also along the same lines: https://godbolt.org/g/Nzed5m GCC figures out BSRNew, but there's a cdqe instruction that's unnecessary
[Bug tree-optimization/82776] Unable to optimize the loop when iteration count is unavailable.
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82776 --- Comment #10 from Ben Maurer --- This appears to be fixed: https://godbolt.org/z/r49nYx6df