[Bug target/119298] [15 Regression] 538.imagick_r is faster when compiled with GCC 14.2 and -Ofast -flto -march=native than with master on Zen5 since r15-3441-g4292297a0f938f

rguenth at gcc dot gnu.org via Gcc-bugs Fri, 11 Apr 2025 02:21:33 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119298


--- Comment #18 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #17)
> (In reply to Jan Hubicka from comment #14)
> >
> > Counting latencies, I think vinserti64x2 is 1 cycle and vpinst is
> > integer->sse move that is slower and set to 4 cycles.
> > Overall it is wrong that we use addss cost to estimate vec_construct:
> > 
> >       case vec_construct:
> >         {
> >           int n = TYPE_VECTOR_SUBPARTS (vectype);
> >           /* N - 1 element inserts into an SSE vector, the possible
> >              GPR -> XMM move is accounted for in add_stmt_cost.  */
> >           if (GET_MODE_BITSIZE (mode) <= 128)
> >             return (n - 1) * ix86_cost->sse_op;
> >           /* One vinserti128 for combining two SSE vectors for AVX256.  */
> >           else if (GET_MODE_BITSIZE (mode) == 256)
> >             return ((n - 2) * ix86_cost->sse_op
> >                     + ix86_vec_cost (mode, ix86_cost->addss));
> >           /* One vinserti64x4 and two vinserti128 for combining SSE
> >              and AVX256 vectors to AVX512.  */
> >           else if (GET_MODE_BITSIZE (mode) == 512)
> >             return ((n - 4) * ix86_cost->sse_op
> >                     + 3 * ix86_vec_cost (mode, ix86_cost->addss));
> >           gcc_unreachable ();
> >         }
> > 
> > I think we may want to have ix86_cost->hard_register->integer_to_sse to cost
> > the construction in integer modes instead of addss?
> 
> I have no recollection on why we are mixing sse_op and addss cost here ...
> It's not a integer to SSE conversion either (again the caller adjusts
> for this in this case).  We seem to use sse_op for the element insert
> into SSE reg and addss for the insert of SSE regs into YMM or ZMM.
> 
> I think it's reasonable to change this to consistently use sse_op.

So this was from r8-6815-gbe77ba2a461eef.  addss was chosen for the inserts
into the larger vectors because

"The following tries to account for the fact that when constructing
AVX256 or AVX512 vectors from elements we can only use insertps to
insert into the low 128bits of a vector but have to use
vinserti128 or vinserti64x4 to build larger AVX256/512 vectors.
Those operations also have higher latency (Agner documents
3 cycles for Broadwell for reg-reg vinserti128 while insertps has
one cycle latency).  Agner doesn't have tables for AVX512 yet but
I guess the story is similar for vinserti64x4.

Latency is similar for FP adds so I re-used ix86_cost->addss for
this cost."

On zen4 vinserti128 is 1 cycle latency, on zen2 it was still 3.  But
OTOH PINSRB/W/D/Q is 4 cycles on zen4 while indeed insertps is 1 cycle.
It will depend very much on the subarch what vec_init will generate and
the costing is likely not accurate with the simple formula used in
ix86_builtin_vectorization_cost.

Definitely re-using addss cost now causes issues.  Maybe we need to
have separate "cost of SSE construction from QI/HI/SI,SF/DI,DF"
and "cost of AVX construction from SSE" and "cost of AVX512 construction
from SSE/AVX" tuning cost entries.  Or just a single combined cost

 { 60, 28, 3, 1, 1, 3, 1 }      /* cost of vector construction from elements
                                   V16QI, V8HI, V4SI/V4SF, V2DI/V2DF, YMM from
XMM, ZMM from XMM, ZMM from YMM */

though I'd probably split the SSE CTOR costs from the YMM/ZMM costs.  Or
we have cost for the element inserts into V16QI, V8HI, V4SI/V4SF, V2DI/V2DF
and V2XMM, V4XMM, V2YMM.

That said, given with extra reg we build V16QI in a tree way, avoding
15 * high pinsrb latency the combined cost might be (more?) interesting.

[Bug target/119298] [15 Regression] 538.imagick_r is faster when compiled with GCC 14.2 and -Ofast -flto -march=native than with master on Zen5 since r15-3441-g4292297a0f938f

Reply via email to