15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa

prathamesh3492 at gcc dot gnu.org via Gcc-bugs Fri, 03 May 2024 03:45:14 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114860


--- Comment #4 from prathamesh3492 at gcc dot gnu.org ---
Hi Tamar,
Sorry for late response.

perf profile for povray with LTO:

Compiled with 82d6d385f97 (commit before a2f4be3dae0):                         
                                                                          
20.03%  pov::All_CSG_Intersect_Intersections                                   
                                  16.42%  pov::All_Plane_Intersections         
                                                             10.29% 
pov::All_Sphere_Intersections                                                  
                    10.10%  pov::Intersect_BBox_Tree

Compiled with a2f4be3dae0:                                                     
                                               19.51% 
pov::All_CSG_Intersect_Intersections                                           
                               16.91%  pov::All_Plane_Intersections            
                                                                          
12.53%  pov::All_Sphere_Intersections                                          
                              9.81%   pov::Intersect_BBox_Tree                  

I verified there are no code-gen differences for any of the above hot
functions.
Running size on povray_r_exe.out shows a slight code-size decrease of 344 bytes
for text section:
Compiled with 82d6d385f97: 1101505
Compiled with a2f4be3dae0: 1101161

Curiously, there’s a meaningful difference for pov::All_Sphere_Intersections,
which seems to be caused due to following adrp instruction (with no code-gen
changes in All_Sphere_Intersections):

Compiled with 82d6d385f97:
 18.07 │4aec44:   adrp  x0, 4e0000 <pov::SetCommandOption(POVMSData*, unsigned
int, pov::shelldata*) [clone .isra.0]+0x1c0>
  1.77 │4aec48:   ldr   d28, [x0, #2784]

Compiled with a2f4be3dae0:
 28.93 │4aeae4:   adrp  x0, 4e0000 <pov::Warning(unsigned int, char const*,
...) [clone .constprop.0]+0x100>
  1.27  │4aeae8:   ldr   d28, [x0, #2432]

This seems to come from following condition in Intersect_Sphere (which gets
inlined into All_Sphere Intersections):

if ((OCSquared >= Radius2) && (t_Closest_Approach < EPSILON))

As far as I see, there’s no difference between both adrp instructions except
the address (4aec44 vs 4aeae4). And as far as I know, adrp will only calculate
pc-relative page address (and not load any data). To check for any possible
icache misses I used L1I_CACHE_REFILL counter, and turns out that there are 64%
more L1 icache misses for above adrp instruction with a2f4be3dae0 compared to
82d6d385f97, which may (partially) explain the performance difference ?
Although perf stat shows there are around 7% more L1 icache misses for whole
program run with 82d6d385f97 compared to a2f4be3dae0.

I could (repeatedly) reproduce the issue on two neoverse-v2 machines.
The full command line passed to the compiler was:
"-O3 -Wl,-z,muldefs -lm -fallow-argument-mismatch -fpermissive -fstack-arrays
-flto -march=native -mcpu=neoverse-v2"

Thanks,
Prathamesh

[Bug target/114860] [14/15 regression] [aarch64] 511.povray regresses by ~5.5% with -O3 -flto -march=native -mcpu=neoverse-v2 since r14-10014-ga2f4be3dae04fa

Reply via email to