https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860
--- Comment #4 from prathamesh3492 at gcc dot gnu.org --- Hi Tamar, Sorry for late response. perf profile for povray with LTO: Compiled with 82d6d385f97 (commit before a2f4be3dae0): 20.03% pov::All_CSG_Intersect_Intersections 16.42% pov::All_Plane_Intersections 10.29% pov::All_Sphere_Intersections 10.10% pov::Intersect_BBox_Tree Compiled with a2f4be3dae0: 19.51% pov::All_CSG_Intersect_Intersections 16.91% pov::All_Plane_Intersections 12.53% pov::All_Sphere_Intersections 9.81% pov::Intersect_BBox_Tree I verified there are no code-gen differences for any of the above hot functions. Running size on povray_r_exe.out shows a slight code-size decrease of 344 bytes for text section: Compiled with 82d6d385f97: 1101505 Compiled with a2f4be3dae0: 1101161 Curiously, there’s a meaningful difference for pov::All_Sphere_Intersections, which seems to be caused due to following adrp instruction (with no code-gen changes in All_Sphere_Intersections): Compiled with 82d6d385f97: 18.07 │4aec44: adrp x0, 4e0000 <pov::SetCommandOption(POVMSData*, unsigned int, pov::shelldata*) [clone .isra.0]+0x1c0> 1.77 │4aec48: ldr d28, [x0, #2784] Compiled with a2f4be3dae0: 28.93 │4aeae4: adrp x0, 4e0000 <pov::Warning(unsigned int, char const*, ...) [clone .constprop.0]+0x100> 1.27 │4aeae8: ldr d28, [x0, #2432] This seems to come from following condition in Intersect_Sphere (which gets inlined into All_Sphere Intersections): if ((OCSquared >= Radius2) && (t_Closest_Approach < EPSILON)) As far as I see, there’s no difference between both adrp instructions except the address (4aec44 vs 4aeae4). And as far as I know, adrp will only calculate pc-relative page address (and not load any data). To check for any possible icache misses I used L1I_CACHE_REFILL counter, and turns out that there are 64% more L1 icache misses for above adrp instruction with a2f4be3dae0 compared to 82d6d385f97, which may (partially) explain the performance difference ? Although perf stat shows there are around 7% more L1 icache misses for whole program run with 82d6d385f97 compared to a2f4be3dae0. I could (repeatedly) reproduce the issue on two neoverse-v2 machines. The full command line passed to the compiler was: "-O3 -Wl,-z,muldefs -lm -fallow-argument-mismatch -fpermissive -fstack-arrays -flto -march=native -mcpu=neoverse-v2" Thanks, Prathamesh