https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115833

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Target|                            |x86_64-*-*
   Last reconfirmed|                            |2024-07-09
     Ever confirmed|0                           |1
             Status|UNCONFIRMED                 |NEW
                 CC|                            |crazylht at gmail dot com,
                   |                            |rguenth at gcc dot gnu.org

--- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> ---
It's because we discover

t.c:3:8: note:   node 0x6624850 (max_nunits=4, refcnt=2) vector(4) unsigned
short
t.c:3:8: note:   op template: _4 = _2 * tt.0_3;
t.c:3:8: note:          stmt 0 _4 = _2 * tt.0_3;
t.c:3:8: note:          stmt 1 _8 = tt.0_3 * _7;
t.c:3:8: note:          stmt 2 _12 = tt.0_3 * _11;
t.c:3:8: note:          stmt 3 _16 = tt.0_3 * _15;
t.c:3:8: note:          children 0x66248e0 0x6624a00
t.c:3:8: note:   node 0x66248e0 (max_nunits=4, refcnt=2) vector(4) unsigned
short

and the comment already says

          if (is_a <bb_vec_info> (vinfo)
              && !oprnd_info->any_pattern)
            {
              /* Now for commutative ops we should see whether we can
                 make the other operand matching.  */
              if (dump_enabled_p ())
                dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                                 "treating operand as external\n");
              oprnd_info->first_dt = dt = vect_external_def;
            }
          else
            {
              if (dump_enabled_p ())
                dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location,
                                 "Build SLP failed: different types\n");
              return 1;
            }

going the loop-SLP way in this case fixes it but we should probably pass down
a hint whether the parent operation could swap (but we don't know whether
that would succeed then).

This then produces

v4hi_smul:
.LFB0:
        .cfi_startproc
        movq    (%rdi), %xmm0
        movd    %esi, %xmm2
        pshuflw $0, %xmm2, %xmm1
        pmullw  %xmm1, %xmm0
        movq    %xmm0, (%rdi)
        ret

note this is all heuristics and shows the difficulty of a greedy search
(without exploring the full search space).

Note costing even favors the "wrong" vectorization, but only slightly:

t.c:3:8: note: Cost model analysis:
_5 1 times scalar_store costs 12 in body
_9 1 times scalar_store costs 12 in body
_13 1 times scalar_store costs 12 in body
_17 1 times scalar_store costs 12 in body
_2 * tt.0_3 1 times scalar_stmt costs 16 in body
tt.0_3 * _7 1 times scalar_stmt costs 16 in body
tt.0_3 * _11 1 times scalar_stmt costs 16 in body
tt.0_3 * _15 1 times scalar_stmt costs 16 in body
node 0x59e9970 1 times vec_construct costs 18 in prologue
node 0x59e9a90 1 times vec_construct costs 18 in prologue
_2 * tt.0_3 1 times vector_stmt costs 16 in body
_5 1 times unaligned_store (misalign -1) costs 12 in body
t.c:3:8: note: Cost model analysis for part in loop 0:
  Vector cost: 64
  Scalar cost: 112
t.c:3:8: note: Basic block will be vectorized using SLP
t.c:3:8: optimized: basic block part vectorized using 8 byte vectors

It seems the very bad code generation is mostly from constructing the
V4HImode vectors going via GPRs with shifts and ORs.  Possibly
constructing a V4SImode vector and then packing to V4HImode would be
better?

Reply via email to