https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942
Bug ID: 96942
Summary: std::pmr::monotonic_buffer_resource causes CPU cache
misses
Product: gcc
Version: unknown
Status: UNCONFIRMED
Severity: normal
Priority: P3
Component: libstdc++
Assignee: unassigned at gcc dot gnu.org
Reporter: dmitriy.ovdienko at gmail dot com
Target Milestone: ---
Created attachment 49183
--> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49183&action=edit
Original implementation
There is a webpage that compares performance of different programming
languages: C++, C, Rust, Java, etc.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/
There is a "binary trees" test there. In this test application creates `perfect
binary tree` and traverses it.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/binarytrees.html#binarytrees
The fastest solution for this test is created in Rust.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/binarytrees.html
C++ implementation of this problem uses `std::pmr::monotonic_buffer_resource`
class as a memory storage.
https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gpp-4.html
I like C++ very much and I've started an investigation why application compiled
in Rust is faster than C++.
At first, I've run a `perf` tool and had found that application compiled in C++
generates a lot of CPU cache misses (54%):
```txt
root@E5530:/home/dmytro_ovdiienko/Sources# perf stat -B -e
cache-references,cache-misses,cycles,instructions,branches,faults,migrations
./bt_orig 21
Performance counter stats for './bt_orig 21':
45,104,136 cache-references
24,448,475 cache-misses # 54.205 % of all cache
refs
19,904,251,283 cycles
30,462,013,065 instructions # 1.53 insn per cycle
4,834,392,341 branches
234,796 faults
2 migrations
2.083603709 seconds time elapsed
5.559471000 seconds user
0.309529000 seconds sys
```
I thought that it is caused by tree traversing. But after I've modified the
code, I found that a lot of cache misses are caused by
`std::pmr::monotonic_buffer_resource` class, which is used as a memory pool.
I've modified that sample to pre-allocate memory required to hold entire binary
tree instead of grow in geometric progression, but it had made things even
worse.
```txt
root@E5530:/home/dmytro_ovdiienko/Sources# perf stat -B -e
cache-references,cache-misses,cycles,instructions,branches,faults,migrations
./bt_orig_prealloc 21
Performance counter stats for './bt_orig_prealloc 21':
66,400,545 cache-references
45,740,962 cache-misses # 68.886 % of all cache
refs
21,461,610,267 cycles
31,296,637,782 instructions # 1.46 insn per cycle
4,967,611,660 branches
575,100 faults
9 migrations
2.219161594 seconds time elapsed
5.464583000 seconds user
0.854839000 seconds sys
```
That looks really weird and I've implemented my own allocator that behaves like
`std::pmr::monotonic_buffer_resource` and with my memory storage CPU cache
misses are dropped to 34%.
```txt
root@E5530:/home/dmytro_ovdiienko/Sources# perf stat -B -e
cache-references,cache-misses,cycles,instructions,branches,faults,migrations
./bt_malloc 21
Performance counter stats for './bt_malloc 21':
40,713,525 cache-references
14,147,648 cache-misses # 34.749 % of all cache
refs
14,823,743,812 cycles
22,306,442,507 instructions # 1.50 insn per cycle
4,331,968,591 branches
60,227 faults
6 migrations
1.474751692 seconds time elapsed
4.282074000 seconds user
0.092476000 seconds sys
```
Execution time is also dropped from 2.12s to 1.52s (on my laptop).
For completness, following is the report for application compiled in Rust:
```txt
Performance counter stats for './rust/target/release/rust 21':
29,268,774 cache-references
12,147,041 cache-misses # 41.502 % of all cache
refs
24,539,557,585 cycles
31,784,741,964 instructions # 1.30 insn per cycle
4,829,547,556 branches
68,023 faults
8 migrations
2.087679654 seconds time elapsed
7.127183000 seconds user
0.123777000 seconds sys
```
Execution time is 2.09s
Given all above, could you review the implementation of
`std::pmr::monotonic_buffer_resource`. This memory storage has to be very fast.
In fact any attempt to speedup the application makes things slower.
Attached are original C++ implmentation, C++ implementation with preallocated
memory and implementation based on my memory allocator, which uses malloc.
I understand that any class in the `std` namespace has to conform to C++
standard, so it can be hard to fix this issue. Probably G++ compiled can be
tweaked to compile more faster code for `std::pmr::monotonic_buffer_resource`.
Environment:
* CPU: Intel(R) Core(TM) i5-3380M CPU @ 2.90GHz
* OS: Ubuntu 20.04.1 LTS
* Command line: g++ --std=c++17 -O3 --omit-frame-pointer -march=ivybridge
-lpthread
* Target: x86_64-linux-gnu
* Configured with: ../src/configure -v --with-pkgversion='Ubuntu
9.3.0-10ubuntu2' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-9
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
--enable-default-pie --with-system-zlib --with-target-system-zlib=auto
--enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686
--with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib
--with-tune=generic --enable-offload-targets=nvptx-none,hsa
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
* Thread model: posix
* gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)