[Bug tree-optimization/119181] Missed vectorization due to imperfect SLP discovery for 2 grouped load with same base pointer (taken as 1 interleaved load)

rguenther at suse dot de via Gcc-bugs Tue, 11 Mar 2025 02:24:50 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119181


--- Comment #9 from rguenther at suse dot de <rguenther at suse dot de> ---
On Tue, 11 Mar 2025, liuhongt at gcc dot gnu.org wrote:

> https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119181
> 
> --- Comment #8 from Hongtao Liu <liuhongt at gcc dot gnu.org> ---
> (In reply to Richard Biener from comment #7)
> > The issue is we detect this as a single interleaving group:
> > 
> > t.c:12:1: note:   Detected interleaving load of size 264
> > t.c:12:1: note:         _1 = *a_26(D);
> > t.c:12:1: note:         _5 = MEM[(double *)a_26(D) + 8B]; 
> > t.c:12:1: note:         _7 = MEM[(double *)a_26(D) + 16B];
> > t.c:12:1: note:         _11 = MEM[(double *)a_26(D) + 24B];
> > t.c:12:1: note:         _14 = MEM[(double *)a_26(D) + 32B];
> > t.c:12:1: note:         _17 = MEM[(double *)a_26(D) + 40B];
> > t.c:12:1: note:         _19 = MEM[(double *)a_26(D) + 48B];
> > t.c:12:1: note:         _22 = MEM[(double *)a_26(D) + 56B];
> > t.c:12:1: note:         <gap of 248 elements>
> > t.c:12:1: note:         _2 = MEM[(double *)a_26(D) + 2048B];
> > t.c:12:1: note:         _4 = MEM[(double *)a_26(D) + 2056B];
> > t.c:12:1: note:         _8 = MEM[(double *)a_26(D) + 2064B];
> > t.c:12:1: note:         _10 = MEM[(double *)a_26(D) + 2072B];
> > t.c:12:1: note:         _13 = MEM[(double *)a_26(D) + 2080B];
> > t.c:12:1: note:         _16 = MEM[(double *)a_26(D) + 2088B];
> > t.c:12:1: note:         _20 = MEM[(double *)a_26(D) + 2096B];
> > t.c:12:1: note:         _23 = MEM[(double *)a_26(D) + 2104B];
> > 
> > so the heuristic to swap operands to get a single group in leafs doesn't
> > work.  Instead you get offsetting costs to avoid runaway with very large
> > gaps:
> Thanks for pointing this.
> > 
> > *a_26(D) 132 times unaligned_load (misalign -1) costs 1584 in body
> > 
> > and that makes it unprofitable.
> > 
> > There is indeed some better heuristic needed where to split groups - gaps
> > bigger than the biggest vector size might be a good candidate.  Note
> > when two different interleaving groups are used in the same SLP leaf
> > we fail as we don't support that yet.
> 
> A simple hack like below works, But I guess we may need better heuristic.

Esp. since you are not supposed to get at a vector type - the dataref
analysis is shared between the iteration through vector types.
The heuristic should probably be based on MAX_BITSIZE_MODE_ANY_MODE,
also instead of checking init_b - init_a I'd check init_b - init_prev,
otherwise we risk breaking a contiguous set of DRs when the gap is
placed oddly around MAX_BITSIZE_MODE_ANY_MODE.

[Bug tree-optimization/119181] Missed vectorization due to imperfect SLP discovery for 2 grouped load with same base pointer (taken as 1 interleaved load)

Reply via email to