https://gcc.gnu.org/bugzilla/show_bug.cgi?id=108487

--- Comment #2 from MARK BOURGEAULT <Mark_B53 at yahoo dot com> ---
>> For fn1, assembly of the inner loop should be identical, so I think the 20% 
>> you were seeing may result from different loop alignment with respect to 32b 
>> fetch boundary
Yes, it does appear that this is the explanation for the difference.  Here are
the full results:

original code
 * gcc 10.3 -std=c++20 -O3 => fn1 = ~2000ms, fn2 = ~1000ms
 * gcc 10.3 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms, fn2 = ~1000ms
 * gcc 12.2 -std=c++20 -O3 => fn1 = ~2500ms, fn2 = ~32000ms
 * gcc 12.2 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms, fn2 = ~32000ms

fn1 only
 * gcc 10.3 -std=c++20 -O3 => fn1 = ~2500ms
 * gcc 10.3 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms
 * gcc 12.2 -std=c++20 -O3 => fn1 = ~2000ms
 * gcc 12.2 -std=c++20 -O3 -falign-loops=32 => fn1 = ~2000ms

>> Also please note that cloud instances backing godbolt.org have different 
>> CPUs, so timing results from different runs are not directly comparable.
Yes, I know.  I really only used godbolt to reach the conclusion that the issue
still exists on trunk.

Reply via email to