https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90018
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|NEW |ASSIGNED --- Comment #14 from Richard Biener <rguenth at gcc dot gnu.org> --- So the important difference when comparing patched/unpatched is the unpatched compiler rejected vectorization with mapz_module.fppized.f90:730:0: note: dependence distance == 0 between *a4_627(D)[_196] and *a4_627(D)[_196] mapz_module.fppized.f90:730:0: note: READ_WRITE dependence in interleaving. mapz_module.fppized.f90:730:0: note: bad data dependence. while the patched compiler is happy. That points to the patched function and it's call here: static bool vect_analyze_data_ref_dependence (struct data_dependence_relation *ddr, loop_vec_info loop_vinfo, unsigned int *max_vf) { ... if (dist == 0) { ... if (!vect_preserves_scalar_order_p (DR_STMT (dra), DR_STMT (drb))) { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "READ_WRITE dependence in interleaving.\n"); return true; it's probably failure to factor in unrolling that breaks this case. The unvectorized loop body looks like (all but relevant loads/stores elided): <bb 20> [local count: 118111594]: # i_313 = PHI <_1(19), i_293(24)> _146 = *a4_255(D)[_145]; _152 = *a4_255(D)[_151]; _165 = *a4_255(D)[_164]; *a4_255(D)[_194] = _195; *a4_255(D)[_201] = _202; _203 = *a4_255(D)[_145]; _290 = *a4_255(D)[_151]; _291 = *a4_255(D)[_194]; *a4_255(D)[_194] = M.42_316; i_293 = i_313 + 1; if (_2 < i_293) final runtime alias checks are: create runtime check for data references *a4_255(D)[_151] and *a4_255(D)[_201] create runtime check for data references *a4_255(D)[_194] and *a4_255(D)[_164] create runtime check for data references *a4_255(D)[_194] and *a4_255(D)[_145] create runtime check for data references *a4_255(D)[_164] and *a4_255(D)[_201] and groups are note: Detected interleaving load *a4_255(D)[_151] and *a4_255(D)[_194] note: Detected interleaving load of size 4 starting with _152 = *a4_255(D)[_151]; note: There is a gap of 2 elements after the group note: Detected single element interleaving *a4_255(D)[_151] step 32 note: not consecutive access *a4_255(D)[_194] = _195; note: using strided accesses note: not consecutive access *a4_255(D)[_194] = M.42_316; note: using strided accesses note: Detected single element interleaving *a4_255(D)[_164] step 32 note: Detected single element interleaving *a4_255(D)[_145] step 32 note: Detected single element interleaving *a4_255(D)[_145] step 32 note: not consecutive access *a4_255(D)[_201] = _202; note: using strided accesses so there's no SLP involved. The respective loop doesn't involve a reduction so -ffast-math shouldn't be required here, only -ffinite-math-only for min/max recognition. A C testcase mimicing the memory accesses and failing is void __attribute__((noinline,noclone)) foo (double *a4, int n) { for (int i = 0; i < n; ++i) { double tem1 = a4[i*4] + a4[i*4+n]; double tem2 = a4[i*4+2*n+1]; a4[i*4+n+1] = tem1; a4[i*4+1] = tem2; double tem3 = a4[i*4] - a4[i*4+1]; double tem4 = tem3 + a4[i*4+n]; a4[i*4+n+1] = tem3 + a4[i*4+n+1]; } } int main() { const int n = 5; double a4[4 * n * 8]; double a42[4 * n * 8]; for (int i = 0; i < 4 * n * 8; ++i) a4[i] = a42[i] = i; foo (a4, n); for (int i = 0; i < n; ++i) { double tem1 = a42[i*4] + a4[i*4+n]; double tem2 = a42[i*4+2*n+1]; a42[i*4+n+1] = tem1; a42[i*4+1] = tem2; double tem3 = a42[i*4] - a42[i*4+1]; double tem4 = tem3 + a42[i*4+n]; a42[i*4+n+1] = tem3 + a42[i*4+n+1]; __asm__ volatile ("": : : "memory"); } for (int i = 0; i < 4 * n * 8; ++i) if (a4[i] != a42[i]) __builtin_abort (); return 0; }