[Bug libstdc++/59439] New: std::locale uses atomic instructions on construction

2013-12-09 Thread ben.maurer at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59439

Bug ID: 59439
   Summary: std::locale uses atomic instructions on construction
   Product: gcc
   Version: 4.8.3
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: libstdc++
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ben.maurer at gmail dot com

In a large multithreaded program that uses stringstream to stringify integers,
I have seen that construction and deconstruction of std::locale can take over
10% of the runtime according to the perf tool.

In bug 40088, std::locale was optimized to avoid the usage of a mutex when
creating the default locale. In our program, this path is being exercised.
However, the act of doing ref counting still has a very large performance
penalty on SMP systems due to cache line bouncing.

If this locale is truly readonly, it would be much better if refcounting could
be avoided. Maybe a refcount of -1 could signify "this object need not be
refcounted".


[Bug libstdc++/59439] std::locale uses atomic instructions on construction

2013-12-09 Thread ben.maurer at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59439

--- Comment #1 from Ben Maurer  ---
Facebook is putting a $50 bounty on this bug via bountysource:
https://www.bountysource.com/issues/1350875


[Bug libstdc++/59439] std::locale uses atomic instructions on construction

2013-12-09 Thread ben.maurer at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59439

--- Comment #3 from Ben Maurer  ---
Created attachment 31405
  --> http://gcc.gnu.org/bugzilla/attachment.cgi?id=31405&action=edit
Benchmark

 _build/opt/experimental/bmaurer/benchmark
snprintf
1 threads took 20 ms
2 threads took 20 ms
3 threads took 20 ms
4 threads took 20 ms
5 threads took 22 ms
..
31 threads took 43 ms
iostream
1 threads took 108 ms
2 threads took 219 ms
3 threads took 371 ms
4 threads took 451 ms
5 threads took 559 ms
6 threads took 655 ms
7 threads took 763 ms
8 threads took 908 ms
9 threads took 1015 ms
...
31 threads took 3176 ms


[Bug libstdc++/59439] std::locale uses atomic instructions on construction

2013-12-09 Thread ben.maurer at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59439

--- Comment #4 from Ben Maurer  ---
Also, here's where perf says time is being spent. While only 25% is shown as
being in the locale constructor/destructor, I suspect that the time spent in
other methods is actually related to the ping-ponging of cachelines caused by
the constructor -- all the time is being spent in memory access which should be
very hot in the CPU cache.

16.77%  benchmark  libstdc++.so.6.0.18  [.] std::locale::locale()
10.42%  benchmark  libstdc++.so.6.0.18  [.] std::locale::~locale()
 9.93%  benchmark  libstdc++.so.6.0.18  [.] bool
std::has_facet > > >(std::locale const&)
 9.84%  benchmark  libstdc++.so.6.0.18  [.] std::num_put > > const&
std::use_facet > > >(std::locale const&)
 9.26%  benchmark  libstdc++.so.6.0.18  [.] bool
std::has_facet > > >(std::locale const&)
 9.10%  benchmark  libstdc++.so.6.0.18  [.] std::num_get > > const&
std::use_facet > > >(std::locale const&)
 8.78%  benchmark  libstdc++.so.6.0.18  [.] std::ctype const&
std::use_facet >(std::locale const&)
 6.14%  benchmark  libstdc++.so.6.0.18  [.]
std::locale::operator=(std::locale const&)
 3.71%  benchmark  libstdc++.so.6.0.18  [.]
std::__use_cache >::operator()(std::locale const&)
const
 3.66%  benchmark  libstdc++.so.6.0.18  [.] bool
std::has_facet >(std::locale const&)
 3.42%  benchmark  libstdc++.so.6.0.18  [.] __dynamic_cast
 1.96%  benchmark  libstdc++.so.6.0.18  [.] std::locale::id::_M_id() const
 0.89%  benchmark  benchmark[.] doIoStream()


[Bug libstdc++/59439] std::locale uses atomic instructions on construction

2013-12-10 Thread ben.maurer at gmail dot com
http://gcc.gnu.org/bugzilla/show_bug.cgi?id=59439

--- Comment #6 from Ben Maurer  ---
I agree this isn't necessarily the cleanest way to accomplish this task.

My discovery of this bug came through a real life application which was
deployed to real users and had a noticeable impact due to this issue. This
isn't benchmarking for the sake of benchmarking, an actual developer decided to
write code this way. In the context of their code, choosing stringstream made
slightly more sense because they were combining multiple integers and strings.
However, bug 40088 suggests that other users may have run into this problem in
the past. Given that iostream shows slower, but still reasonable performance in
the single-thread case, it's possible others will run into it in the future.
Machines will only be getting more cores in the future, so this problem may
increase in impact over time.

In any case, I've fixed the original application, and am filing this bug in the
hopes that we can save others debugging time in the future finding bottlenecks
like this.


[Bug tree-optimization/78103] New: Failure to optimize with __builtin_clzl

2016-10-24 Thread ben.maurer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78103

Bug ID: 78103
   Summary: Failure to optimize with __builtin_clzl
   Product: gcc
   Version: unknown
Status: UNCONFIRMED
  Severity: normal
  Priority: P3
 Component: tree-optimization
  Assignee: unassigned at gcc dot gnu.org
  Reporter: ben.maurer at gmail dot com
  Target Milestone: ---

constexpr
unsigned long findLastSet(unsigned long x) {
  return x ? 8 * sizeof(unsigned long) - __builtin_clzl(x) : 0;
}
constexpr
unsigned long findLastSet2(unsigned long x) {
  return x ? ((8 * sizeof(unsigned long) - 1) ^ __builtin_clzl(x)) + 1 : 0;
}

These two functions are the same, but GCC does a better job at compiling the
second vs the more idiomatic first

https://godbolt.org/g/B2x5iG

findLastSet(unsigned long):
xor eax, eax
testrdi, rdi
je  .L1
bsr rdi, rdi
mov eax, 64
xor rdi, 63
movsx   rdi, edi
sub rax, rdi
.L1:
rep ret
findLastSet2(unsigned long):
xor eax, eax
testrdi, rdi
je  .L6
bsr rdi, rdi
movsx   rax, edi
add rax, 1

[Bug tree-optimization/78103] Failure to optimize with __builtin_clzl

2016-10-24 Thread ben.maurer at gmail dot com
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78103

--- Comment #1 from Ben Maurer  ---
Also along the same lines:

https://godbolt.org/g/Nzed5m

GCC figures out BSRNew, but there's a cdqe instruction that's unnecessary

[Bug tree-optimization/82776] Unable to optimize the loop when iteration count is unavailable.

2024-07-16 Thread ben.maurer at gmail dot com via Gcc-bugs
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82776

--- Comment #10 from Ben Maurer  ---
This appears to be fixed:

https://godbolt.org/z/r49nYx6df