On Mon, Dec 18, 2023 at 6:18 PM Roger Sayle <ro...@nextmovesoftware.com> wrote:
>
>
> This patch fixes two issues with the handling of 128-bit TImode integer
> constants in the x86_64 backend.  The main issue is that GCC always
> tries to load 128-bit integer constants via broadcasts to vector SSE
> registers, even if the result is required in general registers.  This
> is seen in the two closely related functions below:
>
> __int128 m;
> #define CONST (((__int128)0x0123456789abcdefULL<<64) |
> 0x0123456789abcdefULL)
> void foo() { m &= CONST; }
> void bar() { m = CONST; }
>
> When compiled with -O2 -mavx, we currently generate:
>
> foo:    movabsq $81985529216486895, %rax
>         vmovq   %rax, %xmm0
>         vpunpcklqdq     %xmm0, %xmm0, %xmm0
>         vmovq   %xmm0, %rax
>         vpextrq $1, %xmm0, %rdx
>         andq    %rax, m(%rip)
>         andq    %rdx, m+8(%rip)
>         ret
>
> bar:    movabsq $81985529216486895, %rax
>         vmovq   %rax, %xmm1
>         vpunpcklqdq     %xmm1, %xmm1, %xmm0
>         vpextrq $1, %xmm0, %rdx
>         vmovq   %xmm0, m(%rip)
>         movq    %rdx, m+8(%rip)
>         ret
>
> With this patch we defer the decision to use vector broadcast for
> TImode until we know we need actually want a SSE register result,
> by moving the call to ix86_convert_const_wide_int_to_broadcast from
> the RTL expansion pass, to the scalar-to-vector (STV) pass.  With
> this change (and a minor tweak described below) we now generate:
>
> foo:    movabsq $81985529216486895, %rax
>         andq    %rax, m(%rip)
>         andq    %rax, m+8(%rip)
>         ret
>
> bar:    movabsq $81985529216486895, %rax
>         vmovq   %rax, %xmm0
>         vpunpcklqdq     %xmm0, %xmm0, %xmm0
>         vmovdqa %xmm0, m(%rip)
>         ret
>
> showing that we now correctly use vector mode broadcasts (only)
> where appropriate.
>
> The one minor tweak mentioned above is to enable the un-cprop hi/lo
> optimization, that I originally contributed back in September 2004
> https://gcc.gnu.org/pipermail/gcc-patches/2004-September/148756.html
> even when not optimizing for size.  Without this (and currently with
> just -O2) the function foo above generates:
>
> foo:    movabsq $81985529216486895, %rax
>         movabsq $81985529216486895, %rdx
>         andq    %rax, m(%rip)
>         andq    %rdx, m+8(%rip)
>         ret
>
> I'm not sure why (back in 2004) I thought that avoiding the implicit
> "movq %rax, %rdx" instead of a second load was faster, perhaps avoiding
> a dependency to allow better scheduling, but nowadays "movq %rax, %rdx"
> is either eliminated by GCC's hardreg cprop pass, or special cased by
> modern hardware, making the first foo preferrable, not only shorter but
> also faster.
>
> This patch has been tested on x86_64-pc-linux-gnu with make bootstrap
> and make -k check, both with and without --target_board=unix{-m32},
> and with/without -march=cascadelake with no new failures.
> Ok for mainline?
>
>
> 2023-12-18  Roger Sayle  <ro...@nextmovesoftware.com>
>
> gcc/ChangeLog
>         * config/i386/i386-expand.cc
>         (ix86_convert_const_wide_int_to_broadcast): Remove static.
>         (ix86_expand_move): Don't attempt to convert wide constants
>         to SSE using ix86_convert_const_wide_int_to_broadcast here.
>         (ix86_split_long_move): Always un-cprop multi-word constants.
>         * config/i386/i386-expand.h
>         (ix86_convert_const_wide_int_to_broadcast): Prototype here.
>         * config/i386/i386-features.cc: Include i386-expand.h.
>         (timode_scalar_chain::convert_insn): When converting TImode to
>         v1TImode, try ix86_convert_const_wide_int_to_broadcast.
>
> gcc/testsuite/ChangeLog
>         * gcc.target/i386/movti-2.c: New test case.
>         * gcc.target/i386/movti-3.c: Likewise.

OK.

Thanks,
Uros.

Reply via email to