https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119181
--- Comment #9 from rguenther at suse dot de <rguenther at suse dot de> --- On Tue, 11 Mar 2025, liuhongt at gcc dot gnu.org wrote: > https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119181 > > --- Comment #8 from Hongtao Liu <liuhongt at gcc dot gnu.org> --- > (In reply to Richard Biener from comment #7) > > The issue is we detect this as a single interleaving group: > > > > t.c:12:1: note: Detected interleaving load of size 264 > > t.c:12:1: note: _1 = *a_26(D); > > t.c:12:1: note: _5 = MEM[(double *)a_26(D) + 8B]; > > t.c:12:1: note: _7 = MEM[(double *)a_26(D) + 16B]; > > t.c:12:1: note: _11 = MEM[(double *)a_26(D) + 24B]; > > t.c:12:1: note: _14 = MEM[(double *)a_26(D) + 32B]; > > t.c:12:1: note: _17 = MEM[(double *)a_26(D) + 40B]; > > t.c:12:1: note: _19 = MEM[(double *)a_26(D) + 48B]; > > t.c:12:1: note: _22 = MEM[(double *)a_26(D) + 56B]; > > t.c:12:1: note: <gap of 248 elements> > > t.c:12:1: note: _2 = MEM[(double *)a_26(D) + 2048B]; > > t.c:12:1: note: _4 = MEM[(double *)a_26(D) + 2056B]; > > t.c:12:1: note: _8 = MEM[(double *)a_26(D) + 2064B]; > > t.c:12:1: note: _10 = MEM[(double *)a_26(D) + 2072B]; > > t.c:12:1: note: _13 = MEM[(double *)a_26(D) + 2080B]; > > t.c:12:1: note: _16 = MEM[(double *)a_26(D) + 2088B]; > > t.c:12:1: note: _20 = MEM[(double *)a_26(D) + 2096B]; > > t.c:12:1: note: _23 = MEM[(double *)a_26(D) + 2104B]; > > > > so the heuristic to swap operands to get a single group in leafs doesn't > > work. Instead you get offsetting costs to avoid runaway with very large > > gaps: > Thanks for pointing this. > > > > *a_26(D) 132 times unaligned_load (misalign -1) costs 1584 in body > > > > and that makes it unprofitable. > > > > There is indeed some better heuristic needed where to split groups - gaps > > bigger than the biggest vector size might be a good candidate. Note > > when two different interleaving groups are used in the same SLP leaf > > we fail as we don't support that yet. > > A simple hack like below works, But I guess we may need better heuristic. Esp. since you are not supposed to get at a vector type - the dataref analysis is shared between the iteration through vector types. The heuristic should probably be based on MAX_BITSIZE_MODE_ANY_MODE, also instead of checking init_b - init_a I'd check init_b - init_prev, otherwise we risk breaking a contiguous set of DRs when the gap is placed oddly around MAX_BITSIZE_MODE_ANY_MODE.