[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

rguenth at gcc dot gnu.org via Gcc-bugs Sun, 27 Sep 2020 23:48:42 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=96789


--- Comment #30 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #23)
> >  _813 = {_437, _448, _459, _470, _490, _501, _512, _523, _543, _554, _565,
> > _576, _125, _143, _161, _179}; 
> 
> The cost of vec_construct in i386 backend is 64, calculated as 16 x 4
> 
> cut from i386.c
> ---
> /* N element inserts into SSE vectors.  */ 
> int cost = TYPE_VECTOR_SUBPARTS (vectype) * ix86_cost->sse_op;
> ---
> 
> From perspective of pipeline latency, is seems ok, but from perspective of
> rtx_cost, it seems inaccurate since it would be initialized as
> ---
>         vmovd   %eax, %xmm0
>         vpinsrb $1, 1(%rsi), %xmm0, %xmm0
>         vmovd   %eax, %xmm7
>         vpinsrb $1, 3(%rsi), %xmm7, %xmm7
>         vmovd   %eax, %xmm3
>         vpinsrb $1, 17(%rsi), %xmm3, %xmm3
>         vmovd   %eax, %xmm6
>         vpinsrb $1, 19(%rsi), %xmm6, %xmm6
>         vmovd   %eax, %xmm1
>         vpinsrb $1, 33(%rsi), %xmm1, %xmm1
>         vmovd   %eax, %xmm5
>         vpinsrb $1, 35(%rsi), %xmm5, %xmm5
>         vmovd   %eax, %xmm2
>         vpinsrb $1, 49(%rsi), %xmm2, %xmm2
>         vmovd   %eax, %xmm4
>         vpinsrb $1, 51(%rsi), %xmm4, %xmm4
>         vpunpcklwd      %xmm6, %xmm3, %xmm3
>         vpunpcklwd      %xmm4, %xmm2, %xmm2
>         vpunpcklwd      %xmm7, %xmm0, %xmm0
>         vpunpcklwd      %xmm5, %xmm1, %xmm1
>         vpunpckldq      %xmm2, %xmm1, %xmm1
>         vpunpckldq      %xmm3, %xmm0, %xmm0
>         vpunpcklqdq     %xmm1, %xmm0, %xmm0
> ---
> 
> it's 16 "vector insert" + (4 + 2 + 1) "vector concat/permutation", so cost
> should be 92(23 * 4).

So the important part for any target is that it makes the scalar and
vector costs apples and apples because they end up being compared
against each other.  For loops the most important metric tends to be
latency which is also the only thing that can be reasonably costed
when looking at a single statement at a time.  For all other factors
coming in there's (in theory) the finish_cost hook where, after
gathering individual stmt data from add_stmt_cost, a target hook can
apply adjustments based on say functional unit allocation (IIRC
the powerpc backend looks whether there are "many" shifts and
disparages vectorization in that case).

For the vector construction the x86 backend does a reasonable job
in costing - the only thing that's not very well modeled is the
extra cost of constructing from values in GPRs compared to
values in XMM regs (on some CPU archs that even as extra penalties).
But as seen above "GPR" values can also come from memory where
the difference vanishes (for AVX, not for SSE).

[Bug target/96789] x264: sub4x4_dct() improves when vectorization is disabled

Reply via email to