https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target Milestone|--- |12.0 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW Last reconfirmed| |2021-10-28 --- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- Btw, with -O3/-Ofast cray is _much_ faster, despite the vectorization (we see even more there). At -O2 we get c-ray-f.c:370:9: optimized: basic block part vectorized using 16 byte vectors c-ray-f.c:378:9: optimized: basic block part vectorized using 16 byte vectors c-ray-f.c:450:9: optimized: basic block part vectorized using 16 byte vectors c-ray-f.c:442:9: optimized: basic block part vectorized using 16 byte vectors c-ray-f.c:421:9: optimized: basic block part vectorized using 16 byte vectors c-ray-f.c:490:3: optimized: basic block part vectorized using 16 byte vectors c-ray-f.c:360:9: optimized: basic block part vectorized using 16 byte vectors c-ray-f.c:360:9: optimized: basic block part vectorized using 16 byte vectors c-ray-f.c:360:9: optimized: basic block part vectorized using 16 byte vectors c-ray-f.c:273:10: optimized: basic block part vectorized using 16 byte vectors c-ray-f.c:553:18: optimized: basic block part vectorized using 16 byte vectors perf data: .v is vectorized, .nv not: Samples: 53K of event 'cycles', Event count (approx.): 53285786739 Overhead Samples Command Shared Object Symbol 62.31% 32941 c-ray-f.v c-ray-f.v [.] ray_sphere 31.13% 16666 c-ray-f.nv c-ray-f.nv [.] ray_sphere and _likely_ the issue is that we have | int ray_sphere(const struct sphere *sph, struct ray ray, struct sp# | sub $0x98,%rsp # | double a, b, c, d, sqrt_d, t1, t2; # | # | a = SQ(ray.dir.x) + SQ(ray.dir.y) + SQ(ray.dir.z); # | b = 2.0 * ray.dir.x * (ray.orig.x - sph->pos.x) + # | movupd (%rdi),%xmm5 # | 2.0 * ray.dir.y * (ray.orig.y - sph->pos.y) + # | 2.0 * ray.dir.z * (ray.orig.z - sph->pos.z); # 0.02 | movsd 0x10(%rdi),%xmm9 # 0.01 | movupd 0xb8(%rsp),%xmm13 # 37.67 | movupd 0xa0(%rsp),%xmm15 so we pass struct ray on the stack(?) and perform SSE loads from it but the argument passing does 0.88 | movups %xmm2,(%rsp) # 0.22 | movups %xmm3,0x10(%rsp) # 43.81 | movups %xmm4,0x20(%rsp) # 0.66 | call ray_sphere IIRC Zen2 had some 'tricks' to forward stack spill/restore and if that fails for some reason there is probably a penalty - at least in this case it shouldn't be STLF. Note the not vectorized code has the same code on the caller side but loads scalar pieces. Not inlining ray_sphere at -O2 is of course what makes it overall slow. Confirmed on Zen2.