[Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses

dmitriy.ovdienko at gmail dot com Sat, 05 Sep 2020 02:42:27 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942


            Bug ID: 96942
           Summary: std::pmr::monotonic_buffer_resource causes CPU cache
                    misses
           Product: gcc
           Version: unknown
            Status: UNCONFIRMED
          Severity: normal
          Priority: P3
         Component: libstdc++
          Assignee: unassigned at gcc dot gnu.org
          Reporter: dmitriy.ovdienko at gmail dot com
  Target Milestone: ---

Created attachment 49183
  --> https://gcc.gnu.org/bugzilla/attachment.cgi?id=49183&action=edit
Original implementation

There is a webpage that compares performance of different programming
languages: C++, C, Rust, Java, etc.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/

There is a "binary trees" test there. In this test application creates `perfect
binary tree` and traverses it.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/description/binarytrees.html#binarytrees

The fastest solution for this test is created in Rust. 

https://benchmarksgame-team.pages.debian.net/benchmarksgame/performance/binarytrees.html

C++ implementation of this problem uses `std::pmr::monotonic_buffer_resource`
class as a memory storage.

https://benchmarksgame-team.pages.debian.net/benchmarksgame/program/binarytrees-gpp-4.html

I like C++ very much and I've started an investigation why application compiled
in Rust is faster than C++.

At first, I've run a `perf` tool and had found that application compiled in C++
generates a lot of CPU cache misses (54%):

```txt
root@E5530:/home/dmytro_ovdiienko/Sources# perf stat -B -e
cache-references,cache-misses,cycles,instructions,branches,faults,migrations 
./bt_orig 21
 Performance counter stats for './bt_orig 21':

        45,104,136      cache-references                                        
        24,448,475      cache-misses              #   54.205 % of all cache
refs    
    19,904,251,283      cycles                                                  
    30,462,013,065      instructions              #    1.53  insn per cycle     
     4,834,392,341      branches                                                
           234,796      faults                                                  
                 2      migrations                                              

       2.083603709 seconds time elapsed

       5.559471000 seconds user
       0.309529000 seconds sys
```

I thought that it is caused by tree traversing. But after I've modified the
code, I found that a lot of cache misses are caused by
`std::pmr::monotonic_buffer_resource` class, which is used as a memory pool.

I've modified that sample to pre-allocate memory required to hold entire binary
tree instead of grow in geometric progression, but it had made things even
worse.

```txt
root@E5530:/home/dmytro_ovdiienko/Sources# perf stat -B -e
cache-references,cache-misses,cycles,instructions,branches,faults,migrations 
./bt_orig_prealloc 21
 Performance counter stats for './bt_orig_prealloc 21':

        66,400,545      cache-references                                        
        45,740,962      cache-misses              #   68.886 % of all cache
refs    
    21,461,610,267      cycles                                                  
    31,296,637,782      instructions              #    1.46  insn per cycle     
     4,967,611,660      branches                                                
           575,100      faults                                                  
                 9      migrations                                              

       2.219161594 seconds time elapsed

       5.464583000 seconds user
       0.854839000 seconds sys
```

That looks really weird and I've implemented my own allocator that behaves like
`std::pmr::monotonic_buffer_resource` and with my memory storage CPU cache
misses are dropped to 34%.

```txt
root@E5530:/home/dmytro_ovdiienko/Sources# perf stat -B -e
cache-references,cache-misses,cycles,instructions,branches,faults,migrations 
./bt_malloc 21
 Performance counter stats for './bt_malloc 21':

        40,713,525      cache-references                                        
        14,147,648      cache-misses              #   34.749 % of all cache
refs    
    14,823,743,812      cycles                                                  
    22,306,442,507      instructions              #    1.50  insn per cycle     
     4,331,968,591      branches                                                
            60,227      faults                                                  
                 6      migrations                                              

       1.474751692 seconds time elapsed

       4.282074000 seconds user
       0.092476000 seconds sys
```

Execution time is also dropped from 2.12s to 1.52s (on my laptop).

For completness, following is the report for application compiled in Rust:

```txt
 Performance counter stats for './rust/target/release/rust 21':

        29,268,774      cache-references                                        
        12,147,041      cache-misses              #   41.502 % of all cache
refs    
    24,539,557,585      cycles                                                  
    31,784,741,964      instructions              #    1.30  insn per cycle     
     4,829,547,556      branches                                                
            68,023      faults                                                  
                 8      migrations                                              

       2.087679654 seconds time elapsed

       7.127183000 seconds user
       0.123777000 seconds sys
```

Execution time is 2.09s

Given all above, could you review the implementation of
`std::pmr::monotonic_buffer_resource`. This memory storage has to be very fast.
In fact any attempt to speedup the application makes things slower.

Attached are original C++ implmentation, C++ implementation with preallocated
memory and implementation based on my memory allocator, which uses malloc.

I understand that any class in the `std` namespace has to conform to C++
standard, so it can be hard to fix this issue. Probably G++ compiled can be
tweaked to compile more faster code for `std::pmr::monotonic_buffer_resource`.



Environment:
* CPU: Intel(R) Core(TM) i5-3380M CPU @ 2.90GHz
* OS: Ubuntu 20.04.1 LTS
* Command line: g++ --std=c++17 -O3 --omit-frame-pointer -march=ivybridge
-lpthread
* Target: x86_64-linux-gnu
* Configured with: ../src/configure -v --with-pkgversion='Ubuntu
9.3.0-10ubuntu2' --with-bugurl=file:///usr/share/doc/gcc-9/README.Bugs
--enable-languages=c,ada,c++,go,brig,d,fortran,objc,obj-c++,gm2 --prefix=/usr
--with-gcc-major-version-only --program-suffix=-9
--program-prefix=x86_64-linux-gnu- --enable-shared --enable-linker-build-id
--libexecdir=/usr/lib --without-included-gettext --enable-threads=posix
--libdir=/usr/lib --enable-nls --enable-clocale=gnu --enable-libstdcxx-debug
--enable-libstdcxx-time=yes --with-default-libstdcxx-abi=new
--enable-gnu-unique-object --disable-vtable-verify --enable-plugin
--enable-default-pie --with-system-zlib --with-target-system-zlib=auto
--enable-objc-gc=auto --enable-multiarch --disable-werror --with-arch-32=i686
--with-abi=m64 --with-multilib-list=m32,m64,mx32 --enable-multilib
--with-tune=generic --enable-offload-targets=nvptx-none,hsa
--without-cuda-driver --enable-checking=release --build=x86_64-linux-gnu
--host=x86_64-linux-gnu --target=x86_64-linux-gnu
* Thread model: posix
* gcc version 9.3.0 (Ubuntu 9.3.0-10ubuntu2)

[Bug libstdc++/96942] New: std::pmr::monotonic_buffer_resource causes CPU cache misses

Reply via email to