https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114563
--- Comment #11 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Andi Kleen from comment #10) > It doesn't really help for the PR119387 test case, perhaps not surprising > because it optimizes freeing not allocation: > > Summary > ./gcc/cc1plus-opt -w -std=c++20 ~/gcc/git/tsrc/119387-formatted.ii > -quiet ran > 1.02 ± 0.03 times faster than ./gcc/cc1plus -w -std=c++20 > ~/gcc/git/tsrc/119387-formatted.ii -quiet > > > So still need a real test case. Hmm. @@ -782,8 +836,10 @@ alloc_page (unsigned order) entry = NULL; page = NULL; + free_list = find_free_list_order (order, entry_size); + /* Check the list of free pages for one we can use. */ - for (pp = &G.free_pages, p = *pp; p; pp = &p->next, p = *pp) + for (pp = &free_list->free_pages, p = *pp; p; pp = &p->next, p = *pp) if (p->bytes == entry_size) break; so my idea was to have multiple freelists so that p->bytes == entry_size and this list walk, which is the bottleneck for PR119387 I think, is improved. I'm testing the PR119387 testcase again, with -O2 -g -std=c++20 -fno-checking it is the above loop that results in the following in a cc1plus build with -O0: Samples: 2M of event 'cycles:Pu', Event count (approx.): 2787093272529 61.29% 1373199 cc1plus cc1plus [.] alloc_page Using your patch this changes to Samples: 1M of event 'cycles:Pu', Event count (approx.): 1053172130606 0.02% 234 cc1plus cc1plus [.] alloc_page(unsigned int) so the patch works as intended! Btw, I think you can avoid adding the freelist pointer to page_entry by using the O(1) lookup from page_entry->order & page_entry->bytes? The structs size doesn't seem very optimized, so I'm not sure how important that is.