https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101908
Richard Biener <rguenth at gcc dot gnu.org> changed:
What |Removed |Added
----------------------------------------------------------------------------
Target Milestone|--- |12.0
Ever confirmed|0 |1
Status|UNCONFIRMED |NEW
Last reconfirmed| |2021-10-28
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
Btw, with -O3/-Ofast cray is _much_ faster, despite the vectorization (we see
even more there). At -O2 we get
c-ray-f.c:370:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:378:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:450:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:442:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:421:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:490:3: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:360:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:360:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:360:9: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:273:10: optimized: basic block part vectorized using 16 byte vectors
c-ray-f.c:553:18: optimized: basic block part vectorized using 16 byte vectors
perf data: .v is vectorized, .nv not:
Samples: 53K of event 'cycles', Event count (approx.): 53285786739
Overhead Samples Command Shared Object Symbol
62.31% 32941 c-ray-f.v c-ray-f.v [.] ray_sphere
31.13% 16666 c-ray-f.nv c-ray-f.nv [.] ray_sphere
and _likely_ the issue is that we have
| int ray_sphere(const struct sphere *sph, struct ray ray, struct
sp#
| sub $0x98,%rsp
#
| double a, b, c, d, sqrt_d, t1, t2;
#
|
#
| a = SQ(ray.dir.x) + SQ(ray.dir.y) + SQ(ray.dir.z);
#
| b = 2.0 * ray.dir.x * (ray.orig.x - sph->pos.x) +
#
| movupd (%rdi),%xmm5
#
| 2.0 * ray.dir.y * (ray.orig.y - sph->pos.y) +
#
| 2.0 * ray.dir.z * (ray.orig.z - sph->pos.z);
#
0.02 | movsd 0x10(%rdi),%xmm9
#
0.01 | movupd 0xb8(%rsp),%xmm13
#
37.67 | movupd 0xa0(%rsp),%xmm15
so we pass struct ray on the stack(?) and perform SSE loads from it but
the argument passing does
0.88 | movups %xmm2,(%rsp)
#
0.22 | movups %xmm3,0x10(%rsp)
#
43.81 | movups %xmm4,0x20(%rsp)
#
0.66 | call ray_sphere
IIRC Zen2 had some 'tricks' to forward stack spill/restore and if that
fails for some reason there is probably a penalty - at least in this case
it shouldn't be STLF. Note the not vectorized code has the same code
on the caller side but loads scalar pieces.
Not inlining ray_sphere at -O2 is of course what makes it overall slow.
Confirmed on Zen2.