https://gcc.gnu.org/bugzilla/show_bug.cgi?id=114269
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
good (base) vs. bad (peak) on Zen2 with -Ofast -march=native shows
Samples: 654K of event 'cycles', Event count (approx.): 743149709374
Overhead Samples Command Shared Object Symbol
16.71% 109793 zeusmp_peak.amd zeusmp_peak.amd64-m64-mine [.] hsmoc_
14.37% 94016 zeusmp_base.amd zeusmp_base.amd64-m64-mine [.] hsmoc_
8.82% 57979 zeusmp_peak.amd zeusmp_peak.amd64-m64-mine [.]
lorentz_
8.48% 55451 zeusmp_base.amd zeusmp_base.amd64-m64-mine [.]
lorentz_
4.84% 31575 zeusmp_peak.amd zeusmp_peak.amd64-m64-mine [.] momx3_
4.68% 30456 zeusmp_base.amd zeusmp_base.amd64-m64-mine [.] momx3_
4.08% 26675 zeusmp_peak.amd zeusmp_peak.amd64-m64-mine [.]
tranx3_
3.56% 23145 zeusmp_base.amd zeusmp_base.amd64-m64-mine [.]
tranx3_
for hsmoc_ it looks like a difference in transformations done:
-hsmoc.f:826:19: optimized: loop vectorized using 32 byte vectors
(there are a lot more missed vectorizations).
subroutine hsmoc ( emf1, emf2, emf3 )
integer is, ie, js, je, ks, ke
common /gridcomi/
& is, ie, js, je, ks, ke
integer in, jn, kn, ijkn
integer i , j , k
parameter(in = 128+5
& , jn = 128+5
& , kn = 128+5)
parameter(ijkn = 128+5)
real*8 emf1 ( in, jn, kn), emf2 ( in, jn, kn)
real*8 vint (ijkn), bint (ijkn)
do 199 j=js,je+1
do 59 i=is,ie
do 858 k=ks,ke+1
vint(k)= k
bint(k)= k
858 continue
do 58 k=ks,ke+1
emf1(i,j,k) = vint(k)
emf2(i,j,k) = bint(k)
58 continue
59 continue
199 continue
return
end
doesn't reproduce it though. The actual difference for the whole testcase
is of course failed data-ref analysis:
Creating dr for (*emf2_1966(D))[_402]
-analyze_innermost: success.
- base_address: emf2_1966(D)
- offset from base address: (ssizetype) ((((sizetype) _1928 * 17689 +
(sizetype) j_2705 * 133) + (sizetype) i_2672) * 8)
- constant offset from base address: -142584
- step: 141512
- base alignment: 8
+analyze_innermost: hsmoc.f:828:72: missed: failed: evolution of offset is not
affine.
+ base_address:
+ offset from base address:
+ constant offset from base address:
+ step:
+ base alignment: 0
and then
hsmoc.f:826:19: note: === vect_analyze_data_ref_accesses ===
-hsmoc.f:826:19: missed: not consecutive access (*emf1_1964(D))[_402] = _403;
-hsmoc.f:826:19: note: using strided accesses
-hsmoc.f:826:19: missed: not consecutive access (*emf2_1966(D))[_402] = _404;
-hsmoc.f:826:19: note: using strided accesses
and we use gather and fail because of costs.
I suspect that relying on global ranges (that could save us here) is quite
fragile when there's a lot of other code around and thus opportunity for
random transforms "trashing" them.
Using the patch from PR114151 and enabling ranger during vectorization oddly
enough doesn't help (even when wiping the SCEV cache).
The odd thing is with the testcase above we get
Access function 0: (integer(kind=8)) {(((unsigned long) _30 * 17689 +
(unsigned long) _10) + (unsigned long) _66) + 18446744073709533793, +,
17689}_4;
where you can see some of the unsigned promotion being done, but we
still succeed.
As I'm lacking a smaller testcase right now it's difficult to understand why
we fail in one case but not the other.