https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96942
--- Comment #9 from Alexander Monakov <amonakov at gcc dot gnu.org> --- The most pronounced difference for depth=18 seems to be caused by m_b_r over-allocating by 2x: internally it mallocs 2x of the size given to the constructor, and then Linux pre-faults those extra pages, penalizing the benchmark. Dividing estimated size by 2 to counter the over-allocation effect: MemoryPool store (poolSize(stretch_depth) / 2); substantially improves the benchmark for me. I think the rest of the slowdown can be attributed to m_b_r simply doing more work internally compared to your bare-bones malloc allocator (I'm seeing less pronounced differences though, I'm testing on a Sandybridge CPU with -O2).