https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79245
--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- Note the trivial fix will FAIL gcc.dg/tree-ssa/ldist-23.c which looks like int i; for (i = 0; i < 128; ++i) { a[i] = a[i] + 1; b[i] = d[i]; c[i] = a[i] / d[i]; } where the testcase expects b[i] = d[i] to be split out as memcpy but the other two partitions to be fused. Generally the cost model lacks computing the number of input/output streams of a partition and a target interface to query it about limits. Usually store bandwidth is not equal to load bandwidth and not re-used store streams can benefit from non-temporal stores being used by libc. In your testcase I wonder whether distributing to for (int j = 0; j < x; j++) { for (int i = 0; i < y; i++) { c[j][i] = b[j][i] - a[j][i]; } } memcpy (a, b, ...); would be faster in the end (or even doing the memcpy first in this case). Well, for now let's be more conservative given the cost model really is lacking.