https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942
Bug ID: 96942 Summary: std::pmr::monotonic_buffer_resource causes CPU cache misses Product: gcc Version: unknown Status: UNCONFIRMED Severity: normal Priority: P3 Component: libstdc++ Assignee: unassigned at gcc dot gnu.org Reporter: dmitriy.ovdienko at gmail dot com Target Milestone: --- Created attachment 49183 --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49183&action=edit Original implementation There is a webpage that compares performance of different programming languages: C++, C, Rust, Java, etc. https://benchmarksgame-team.pages.debian.net/benchmarksgame/ There is a "binary trees" test there. In this test application creates `perfect binary tree` and traverses it. https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/binarytrees.html#binarytrees The fastest solution for this test is created in Rust. https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/binarytrees.html C++ implementation of this problem uses `std::pmr::monotonic_buffer_resource` class as a memory storage. https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gpp-4.html I like C++ very much and I've started an investigation why application compiled in Rust is faster than C++. At first, I've run a `perf` tool and had found that application compiled in C++ generates a lot of CPU cache misses (54%): ```txt root@E5530:/home/dmytro_ovdiienko/Sources# perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations ./bt_orig 21 Performance counter stats for './bt_orig 21': 45,104,136 cache-references 24,448,475 cache-misses # 54.205 % of all cache refs 19,904,251,283 cycles 30,462,013,065 instructions # 1.53 insn per cycle 4,834,392,341 branches 234,796 faults 2 migrations 2.083603709 seconds time elapsed 5.559471000 seconds user 0.309529000 seconds sys ``` I thought that it is caused by tree traversing. But after I've modified the code, I found that a lot of cache misses are caused by `std::pmr::monotonic_buffer_resource` class, which is used as a memory pool. I've modified that sample to pre-allocate memory required to hold entire binary tree instead of grow in geometric progression, but it had made things even worse. ```txt root@E5530:/home/dmytro_ovdiienko/Sources# perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations ./bt_orig_prealloc 21 Performance counter stats for './bt_orig_prealloc 21': 66,400,545 cache-references 45,740,962 cache-misses # 68.886 % of all cache refs 21,461,610,267 cycles 31,296,637,782 instructions # 1.46 insn per cycle 4,967,611,660 branches 575,100 faults 9 migrations 2.219161594 seconds time elapsed 5.464583000 seconds user 0.854839000 seconds sys ``` That looks really weird and I've implemented my own allocator that behaves like `std::pmr::monotonic_buffer_resource` and with my memory storage CPU cache misses are dropped to 34%. ```txt root@E5530:/home/dmytro_ovdiienko/Sources# perf stat -B -e cache-references,cache-misses,cycles,instructions,branches,faults,migrations ./bt_malloc 21 Performance counter stats for './bt_malloc 21': 40,713,525 cache-references 14,147,648 cache-misses # 34.749 % of all cache refs 14,823,743,812 cycles 22,306,442,507 instructions # 1.50 insn per cycle 4,331,968,591 branches 60,227 faults 6 migrations 1.474751692 seconds time elapsed 4.282074000 seconds user 0.092476000 seconds sys ``` Execution time is also dropped from 2.12s to 1.52s (on my laptop). For completness, following is the report for application compiled in Rust: ```txt Performance counter stats for './rust/target/release/rust 21': 29,268,774 cache-references 12,147,041 cache-misses # 41.502 % of all cache refs 24,539,557,585 cycles 31,784,741,964 instructions # 1.30 insn per cycle 4,829,547,556 branches 68,023 faults 8 migrations 2.087679654 seconds time elapsed 7.127183000 seconds user 0.123777000 seconds sys ``` Execution time is 2.09s Given all above, could you review the implementation of `std::pmr::monotonic_buffer_resource`. This memory storage has to be very fast. In fact any attempt to speedup the application makes things slower. Attached are original C++ implmentation, C++ implementation with preallocated memory and implementation based on my memory allocator, which uses malloc. I understand that any class in the `std` namespace has to conform to C++ standard, so it can be hard to fix this issue. Probably G++ compiled can be tweaked to compile more faster code for `std::pmr::monotonic_buffer_resource`. Environment: * CPU: Intel(R) Core(TM) i5-3380M CPU @ 2.90GHz * OS: Ubuntu 20.04.1 LTS * Command line: g++ --std=c++17 -O3 --omit-frame-pointer -march=ivybridge -lpthread * Target: x86_64-linux-gnu * Configured with: ../src/configure -v --with-pkgversion='Ubuntu 9.3.0-10ubuntu2' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs --enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr --with-gcc-major-version-only --program-suffix=-9 --program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id --libexecdir=/usr/lib --without-included-gettext --enable-threads=posix --libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug --enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new --enable-gnu-unique-object --disable-vtable-verify --enable-plugin --enable-default-pie --with-system-zlib --with-target-system-zlib=auto --enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686 --with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib --with-tune=generic --enable-offload-targets=nvptx-none,hsa --without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu --host=x86_64-linux-gnu --target=x86_64-linux-gnu * Thread model: posix * gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)