https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115833
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Target| |x86_64-*-* Last reconfirmed| |2024-07-09 Ever confirmed|0 |1 Status|UNCONFIRMED |NEW CC| |crazylht at gmail dot com, | |rguenth at gcc dot gnu.org --- Comment #1 from Richard Biener <rguenth at gcc dot gnu.org> --- It's because we discover t.c:3:8: note: node 0x6624850 (max_nunits=4, refcnt=2) vector(4) unsigned short t.c:3:8: note: op template: _4 = _2 * tt.0_3; t.c:3:8: note: stmt 0 _4 = _2 * tt.0_3; t.c:3:8: note: stmt 1 _8 = tt.0_3 * _7; t.c:3:8: note: stmt 2 _12 = tt.0_3 * _11; t.c:3:8: note: stmt 3 _16 = tt.0_3 * _15; t.c:3:8: note: children 0x66248e0 0x6624a00 t.c:3:8: note: node 0x66248e0 (max_nunits=4, refcnt=2) vector(4) unsigned short and the comment already says if (is_a <bb_vec_info> (vinfo) && !oprnd_info->any_pattern) { /* Now for commutative ops we should see whether we can make the other operand matching. */ if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "treating operand as external\n"); oprnd_info->first_dt = dt = vect_external_def; } else { if (dump_enabled_p ()) dump_printf_loc (MSG_MISSED_OPTIMIZATION, vect_location, "Build SLP failed: different types\n"); return 1; } going the loop-SLP way in this case fixes it but we should probably pass down a hint whether the parent operation could swap (but we don't know whether that would succeed then). This then produces v4hi_smul: .LFB0: .cfi_startproc movq (%rdi), %xmm0 movd %esi, %xmm2 pshuflw $0, %xmm2, %xmm1 pmullw %xmm1, %xmm0 movq %xmm0, (%rdi) ret note this is all heuristics and shows the difficulty of a greedy search (without exploring the full search space). Note costing even favors the "wrong" vectorization, but only slightly: t.c:3:8: note: Cost model analysis: _5 1 times scalar_store costs 12 in body _9 1 times scalar_store costs 12 in body _13 1 times scalar_store costs 12 in body _17 1 times scalar_store costs 12 in body _2 * tt.0_3 1 times scalar_stmt costs 16 in body tt.0_3 * _7 1 times scalar_stmt costs 16 in body tt.0_3 * _11 1 times scalar_stmt costs 16 in body tt.0_3 * _15 1 times scalar_stmt costs 16 in body node 0x59e9970 1 times vec_construct costs 18 in prologue node 0x59e9a90 1 times vec_construct costs 18 in prologue _2 * tt.0_3 1 times vector_stmt costs 16 in body _5 1 times unaligned_store (misalign -1) costs 12 in body t.c:3:8: note: Cost model analysis for part in loop 0: Vector cost: 64 Scalar cost: 112 t.c:3:8: note: Basic block will be vectorized using SLP t.c:3:8: optimized: basic block part vectorized using 8 byte vectors It seems the very bad code generation is mostly from constructing the V4HImode vectors going via GPRs with shifts and ORs. Possibly constructing a V4SImode vector and then packing to V4HImode would be better?