https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298
--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Richard Biener from comment #17) > (In reply to Jan Hubicka from comment #14) > > > > Counting latencies, I think vinserti64x2 is 1 cycle and vpinst is > > integer->sse move that is slower and set to 4 cycles. > > Overall it is wrong that we use addss cost to estimate vec_construct: > > > > case vec_construct: > > { > > int n = TYPE_VECTOR_SUBPARTS (vectype); > > /* N - 1 element inserts into an SSE vector, the possible > > GPR -> XMM move is accounted for in add_stmt_cost. */ > > if (GET_MODE_BITSIZE (mode) <= 128) > > return (n - 1) * ix86_cost->sse_op; > > /* One vinserti128 for combining two SSE vectors for AVX256. */ > > else if (GET_MODE_BITSIZE (mode) == 256) > > return ((n - 2) * ix86_cost->sse_op > > + ix86_vec_cost (mode, ix86_cost->addss)); > > /* One vinserti64x4 and two vinserti128 for combining SSE > > and AVX256 vectors to AVX512. */ > > else if (GET_MODE_BITSIZE (mode) == 512) > > return ((n - 4) * ix86_cost->sse_op > > + 3 * ix86_vec_cost (mode, ix86_cost->addss)); > > gcc_unreachable (); > > } > > > > I think we may want to have ix86_cost->hard_register->integer_to_sse to cost > > the construction in integer modes instead of addss? > > I have no recollection on why we are mixing sse_op and addss cost here ... > It's not a integer to SSE conversion either (again the caller adjusts > for this in this case). We seem to use sse_op for the element insert > into SSE reg and addss for the insert of SSE regs into YMM or ZMM. > > I think it's reasonable to change this to consistently use sse_op. So this was from r8-6815-gbe77ba2a461eef. addss was chosen for the inserts into the larger vectors because "The following tries to account for the fact that when constructing AVX256 or AVX512 vectors from elements we can only use insertps to insert into the low 128bits of a vector but have to use vinserti128 or vinserti64x4 to build larger AVX256/512 vectors. Those operations also have higher latency (Agner documents 3 cycles for Broadwell for reg-reg vinserti128 while insertps has one cycle latency). Agner doesn't have tables for AVX512 yet but I guess the story is similar for vinserti64x4. Latency is similar for FP adds so I re-used ix86_cost->addss for this cost." On zen4 vinserti128 is 1 cycle latency, on zen2 it was still 3. But OTOH PINSRB/W/D/Q is 4 cycles on zen4 while indeed insertps is 1 cycle. It will depend very much on the subarch what vec_init will generate and the costing is likely not accurate with the simple formula used in ix86_builtin_vectorization_cost. Definitely re-using addss cost now causes issues. Maybe we need to have separate "cost of SSE construction from QI/HI/SI,SF/DI,DF" and "cost of AVX construction from SSE" and "cost of AVX512 construction from SSE/AVX" tuning cost entries. Or just a single combined cost { 60, 28, 3, 1, 1, 3, 1 } /* cost of vector construction from elements V16QI, V8HI, V4SI/V4SF, V2DI/V2DF, YMM from XMM, ZMM from XMM, ZMM from YMM */ though I'd probably split the SSE CTOR costs from the YMM/ZMM costs. Or we have cost for the element inserts into V16QI, V8HI, V4SI/V4SF, V2DI/V2DF and V2XMM, V4XMM, V2YMM. That said, given with extra reg we build V16QI in a tree way, avoding 15 * high pinsrb latency the combined cost might be (more?) interesting.