https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942
--- Comment #9 from Alexander Monakov <amonakov at gcc dot gnu.org> ---
The most pronounced difference for depth=18 seems to be caused by m_b_r
over-allocating by 2x: internally it mallocs 2x of the size given to the
constructor, and then Linux pre-faults those extra pages, penalizing the
benchmark.
Dividing estimated size by 2 to counter the over-allocation effect:
MemoryPool store (poolSize(stretch_depth) / 2);
substantially improves the benchmark for me.
I think the rest of the slowdown can be attributed to m_b_r simply doing more
work internally compared to your bare-bones malloc allocator (I'm seeing less
pronounced differences though, I'm testing on a Sandybridge CPU with -O2).