[Bug target/83203] [8 Regression] Inefficient int to avx2 vector conversion

jakub at gcc dot gnu.org Thu, 30 Nov 2017 06:09:14 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=83203


Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
                 CC|                            |hubicka at gcc dot gnu.org,
                   |                            |uros at gcc dot gnu.org

--- Comment #3 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
So, the above mentioned change optimizes during cse1:
(insn 8 7 9 2 (set (reg:V2DI 91)
        (vec_merge:V2DI (vec_duplicate:V2DI (reg/v:DI 88 [ x ]))
            (reg:V2DI 91)
            (const_int 1 [0x1]))) "pr83203.c":6 3655 {sse4_1_pinsrq}
     (expr_list:REG_DEAD (reg/v:DI 88 [ x ])
        (nil)))
to:
(insn 8 7 9 2 (set (reg:V2DI 91)
        (vec_concat:V2DI (reg/v:DI 88 [ x ])
            (const_int 0 [0]))) "pr83203.c":6 3738 {vec_concatv2di}
     (expr_list:REG_DEAD (reg/v:DI 88 [ x ])
        (nil)))
as pseudo 91 contains all zeros.
Now, because this is generic tuning we force that into stack.
Though I must repeat for the nth time that this is very confusing; either for
some AMD chips (is it really that bad in contemporary ones) vmovd is way too
expensive, but then either vpinsrq is also too expensive (in that case we
should be happy we emit what we do now on the trunk; but then
<sse2p4_1>_pinsr<ssemodesuffix> should use Yi instead of x or v in alternatives
with r input; and similarly use Yi in vec_concatv2di in the vpinsrq and pinsrq
alternatives), or vmovd is expensive, but vpinsrq is not, then we just should
use vpinsrq for the vec_concatv2di pattern,
(i.e. add alternative for =x,r,C which will split into clearing the destination
plus vpinsrq).

Another thing is that with -O2 -mavx2 -mtune=intel we emit:
        vmovq   %rdi, %xmm0
        vmovdqa %xmm0, %xmm0
        ret
when we could just emit
        vmovq   %rdi, %xmm0
I think.  I guess we'd need a pattern for combine that would match what
combiner's trying:
(set (reg:V4DI 90)
    (vec_concat:V4DI (vec_concat:V2DI (reg/v:DI 88 [ x ])
            (const_int 0 [0]))
        (const_vector:V2DI [
                (const_int 0 [0])
                (const_int 0 [0])
            ])))
and perhaps simplify that into something different - vec_select from all zeros
and vec_duplicate, so that we don't need to list all weird cases?
Though perhaps the r254548 change goes here in the wrong direction.

[Bug target/83203] [8 Regression] Inefficient int to avx2 vector conversion

Reply via email to