https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108487
--- Comment #2 from MARK BOURGEAULT <Mark_B53 at yahoo dot com> --- >> For fn1, assembly of the inner loop should be identical, so I think the 20% >> you were seeing may result from different loop alignment with respect to 32b >> fetch boundary Yes, it does appear that this is the explanation for the difference. Here are the full results: original code * gcc 10.3 -std=c++20 -O3 => fn1 = ~2000ms, fn2 = ~1000ms * gcc 10.3 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms, fn2 = ~1000ms * gcc 12.2 -std=c++20 -O3 => fn1 = ~2500ms, fn2 = ~32000ms * gcc 12.2 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms, fn2 = ~32000ms fn1 only * gcc 10.3 -std=c++20 -O3 => fn1 = ~2500ms * gcc 10.3 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms * gcc 12.2 -std=c++20 -O3 => fn1 = ~2000ms * gcc 12.2 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms >> Also please note that cloud instances backing godbolt.org have different >> CPUs, so timing results from different runs are not directly comparable. Yes, I know. I really only used godbolt to reach the conclusion that the issue still exists on trunk.