On Mon, Dec 18, 2023 at 6:18 PM Roger Sayle <ro...@nextmovesoftware.com> wrote: > > > This patch fixes two issues with the handling of 128-bit TImode integer > constants in the x86_64 backend. The main issue is that GCC always > tries to load 128-bit integer constants via broadcasts to vector SSE > registers, even if the result is required in general registers. This > is seen in the two closely related functions below: > > __int128 m; > #define CONST (((__int128)0x0123456789abcdefULL<<64) | > 0x0123456789abcdefULL) > void foo() { m &= CONST; } > void bar() { m = CONST; } > > When compiled with -O2 -mavx, we currently generate: > > foo: movabsq $81985529216486895, %rax > vmovq %rax, %xmm0 > vpunpcklqdq %xmm0, %xmm0, %xmm0 > vmovq %xmm0, %rax > vpextrq $1, %xmm0, %rdx > andq %rax, m(%rip) > andq %rdx, m+8(%rip) > ret > > bar: movabsq $81985529216486895, %rax > vmovq %rax, %xmm1 > vpunpcklqdq %xmm1, %xmm1, %xmm0 > vpextrq $1, %xmm0, %rdx > vmovq %xmm0, m(%rip) > movq %rdx, m+8(%rip) > ret > > With this patch we defer the decision to use vector broadcast for > TImode until we know we need actually want a SSE register result, > by moving the call to ix86_convert_const_wide_int_to_broadcast from > the RTL expansion pass, to the scalar-to-vector (STV) pass. With > this change (and a minor tweak described below) we now generate: > > foo: movabsq $81985529216486895, %rax > andq %rax, m(%rip) > andq %rax, m+8(%rip) > ret > > bar: movabsq $81985529216486895, %rax > vmovq %rax, %xmm0 > vpunpcklqdq %xmm0, %xmm0, %xmm0 > vmovdqa %xmm0, m(%rip) > ret > > showing that we now correctly use vector mode broadcasts (only) > where appropriate. > > The one minor tweak mentioned above is to enable the un-cprop hi/lo > optimization, that I originally contributed back in September 2004 > https://gcc.gnu.org/pipermail/gcc-patches/2004-September/148756.html > even when not optimizing for size. Without this (and currently with > just -O2) the function foo above generates: > > foo: movabsq $81985529216486895, %rax > movabsq $81985529216486895, %rdx > andq %rax, m(%rip) > andq %rdx, m+8(%rip) > ret > > I'm not sure why (back in 2004) I thought that avoiding the implicit > "movq %rax, %rdx" instead of a second load was faster, perhaps avoiding > a dependency to allow better scheduling, but nowadays "movq %rax, %rdx" > is either eliminated by GCC's hardreg cprop pass, or special cased by > modern hardware, making the first foo preferrable, not only shorter but > also faster. > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32}, > and with/without -march=cascadelake with no new failures. > Ok for mainline? > > > 2023-12-18 Roger Sayle <ro...@nextmovesoftware.com> > > gcc/ChangeLog > * config/i386/i386-expand.cc > (ix86_convert_const_wide_int_to_broadcast): Remove static. > (ix86_expand_move): Don't attempt to convert wide constants > to SSE using ix86_convert_const_wide_int_to_broadcast here. > (ix86_split_long_move): Always un-cprop multi-word constants. > * config/i386/i386-expand.h > (ix86_convert_const_wide_int_to_broadcast): Prototype here. > * config/i386/i386-features.cc: Include i386-expand.h. > (timode_scalar_chain::convert_insn): When converting TImode to > v1TImode, try ix86_convert_const_wide_int_to_broadcast. > > gcc/testsuite/ChangeLog > * gcc.target/i386/movti-2.c: New test case. > * gcc.target/i386/movti-3.c: Likewise.
OK. Thanks, Uros.