https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91934
--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> --- So the difference between good and bad is data-ref access analysis which figures single-element interleaving in GCC 8 and nicer interleaving in GCC 9 where I rewrote parts of that analysis: t.c:15:9: note: === vect_analyze_data_ref_accesses === t.c:15:9: note: Detected interleaving load _6->i and _6->q t.c:15:9: note: Detected interleaving load _8->i and _8->q t.c:15:9: note: Detected interleaving load _34->i and _34->q t.c:15:9: note: Detected interleaving load _32->i and _32->q t.c:15:9: note: Detected interleaving load _3->i and _37->i t.c:15:9: note: Queuing group with duplicate access for fixup t.c:15:9: note: Detected interleaving load _3->i and _3->q t.c:15:9: note: Detected interleaving load _3->i and _37->q t.c:15:9: note: Detected interleaving store _3->i and _37->i t.c:15:9: note: Queuing group with duplicate access for fixup t.c:15:9: note: Detected interleaving store _3->i and _3->q t.c:15:9: note: Detected interleaving store _3->i and _37->q see the 'Queuing group with duplicate access' parts which is a new feature that deals with interleaving exposed by unrolling a bit better. In particular we have redundancies the old code simply gives up on: <bb 3> [local count: 66409497]: # j_40 = PHI <0(5), j_75(21)> # ivtmp_28 = PHI <200(5), ivtmp_44(21)> idx_22 = _1 + j_40; _2 = j_40 * 8; _3 = dst_23(D) + _2; _4 = _3->i; ... _38 = j_40 * 8; _37 = dst_23(D) + _38; _36 = _37->i; while the new code simply leaves them in place, vectorizing them. So for GCC 9 the fix for PR87105 (specifically r265457) fixed this.