[Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

rguenth at gcc dot gnu.org via Gcc-bugs Thu, 28 Oct 2021 06:02:52 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908


Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
   Target Milestone|---                         |12.0
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2021-10-28

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, with -O3/-Ofast cray is _much_ faster, despite the vectorization (we see
even more there).  At -O2 we get

c-ray-f.c:370:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:378:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:450:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:442:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:421:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:490:3: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:360:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:360:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:360:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:273:10: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:553:18: optimized: basic block part vectorized using 16 byte vectors

perf data: .v is vectorized, .nv not:

Samples: 53K of event 'cycles', Event count (approx.): 53285786739              
Overhead       Samples  Command     Shared Object     Symbol                    
  62.31%         32941  c-ray-f.v   c-ray-f.v         [.] ray_sphere
  31.13%         16666  c-ray-f.nv  c-ray-f.nv        [.] ray_sphere

and _likely_ the issue is that we have

       |     int ray_sphere(const struct sphere *sph, struct ray ray, struct
sp#
       |       sub      $0x98,%rsp                                            
#
       |     double a, b, c, d, sqrt_d, t1, t2;                               
#
       |                                                                      
#
       |     a = SQ(ray.dir.x) + SQ(ray.dir.y) + SQ(ray.dir.z);               
#
       |     b = 2.0 * ray.dir.x * (ray.orig.x - sph->pos.x) +                
#
       |       movupd   (%rdi),%xmm5                                          
#
       |     2.0 * ray.dir.y * (ray.orig.y - sph->pos.y) +                    
#
       |     2.0 * ray.dir.z * (ray.orig.z - sph->pos.z);                     
#
  0.02 |       movsd    0x10(%rdi),%xmm9                                      
#
  0.01 |       movupd   0xb8(%rsp),%xmm13                                     
#
 37.67 |       movupd   0xa0(%rsp),%xmm15                                

so we pass struct ray on the stack(?) and perform SSE loads from it but
the argument passing does

  0.88 |       movups %xmm2,(%rsp)                                            
#
  0.22 |       movups %xmm3,0x10(%rsp)                                        
#
 43.81 |       movups %xmm4,0x20(%rsp)                                        
#
  0.66 |       call   ray_sphere                   

IIRC Zen2 had some 'tricks' to forward stack spill/restore and if that
fails for some reason there is probably a penalty - at least in this case
it shouldn't be STLF.  Note the not vectorized code has the same code
on the caller side but loads scalar pieces.

Not inlining ray_sphere at -O2 is of course what makes it overall slow.

Confirmed on Zen2.

[Bug tree-optimization/101908] [12 regression] cray regression with -O2 -ftree-slp-vectorize compared to -O2

Reply via email to