https://gcc.gnu.org/bugzilla/show_bug.cgi?id=120457

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |ASSIGNED
     Ever confirmed|0                           |1
   Last reconfirmed|                            |2025-05-30
           Assignee|unassigned at gcc dot gnu.org      |rguenth at gcc dot 
gnu.org
   Target Milestone|---                         |16.0
           Keywords|                            |missed-optimization
            Summary|gcc.dg/vect/pr79920.c fail  |[16 Regression]
                   |starting with               |gcc.dg/vect/pr79920.c fail
                   |r16-924-g1bc5b47f5b06dc     |starting with
                   |                            |r16-924-g1bc5b47f5b06dc

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
So do we now no longer put t32[ip_1][0] and t32[ip_1][2] into a single DR
group?
IIRC power has V2DF but nothing larger, so those two elements will never get
loaded together, meaning the heuristic makes some sense.

t.c:14:7: note:   Detected single element interleaving *_4 step 24
t.c:14:7: note:   Detected single element interleaving *_10 step 24

But what happens is that we end up lowering this into a vector interleaving
scheme that covers the other access anyway:

t.c:14:7: note:   node 0x3dda5b0 (max_nunits=2, refcnt=2) vector(2) double
t.c:14:7: note:   op: VEC_PERM_EXPR
t.c:14:7: note:         stmt 0 _5 = *_4;
t.c:14:7: note:         lane permutation { 0[0] }
t.c:14:7: note:         children 0x3dda910
t.c:14:7: note:   node 0x3dda910 (max_nunits=1, refcnt=1) vector(2) double
t.c:14:7: note:   op: VEC_PERM_EXPR
t.c:14:7: note:         stmt 0 _5 = *_4;
t.c:14:7: note:         stmt 1 _5 = *_4;
t.c:14:7: note:         lane permutation { 0[0] 0[0] }
t.c:14:7: note:         children 0x3dda880
t.c:14:7: note:   node 0x3dda880 (max_nunits=2, refcnt=2) vector(2) double
t.c:14:7: note:   op template: _5 = *_4;
t.c:14:7: note:         stmt 0 _5 = *_4;
t.c:14:7: note:         stmt 1 ---
t.c:14:7: note:         stmt 2 ---

but then decide

t.c:14:7: note:   === vect_slp_analyze_operations ===
t.c:14:7: note:   ==> examining statement: _5 = *_4;
t.c:14:7: missed:   single-element interleaving not supported for not adjacent
vector loads

that would get us a elementwise accesses if we'd have just a single lane,
but the lowering above wrecked that path

t.c:15:31: missed:   not vectorized: relevant stmt not supported: _5 = *_4;
t.c:14:7: note:   unsupported SLP instance starting from: t33[ip_1_46][i_0_47]
= _14;
t.c:14:7: missed:  unsupported SLP instances

this is a heuristic as well:

          /* If this is single-element interleaving with an element
             distance that leaves unused vector loads around fall back
             to elementwise access if possible - we otherwise least
             create very sub-optimal code in that case (and
             blow up memory, see PR65518).  */
          if (loop_vinfo
              && single_element_p
              && (*memory_access_type == VMAT_CONTIGUOUS
                  || *memory_access_type == VMAT_CONTIGUOUS_REVERSE)
              && maybe_gt (group_size, TYPE_VECTOR_SUBPARTS (vectype)))

we can extend that to be a permute lowering heuristic as well.

Reply via email to