V sob., 24. avg. 2024 17:11 je oseba Roger Sayle <ro...@nextmovesoftware.com> napisala:
> > This patch tweaks timode_scalar_chain::compute_convert_gain to better > reflect the expansion of V1TImode arithmetic right shifts by the i386 > backend. The comment "see ix86_expand_v1ti_ashiftrt" appears after > "case ASHIFTRT" in compute_convert_gain, and the changes below attempt > to better match the logic used there. > > The original motivating example is: > > __int128 m1; > void foo() > { > m1 = (m1 << 8) >> 8; > } > > which with -O2 -mavx2 we fail to convert to vector form due to the > inappropriate cost of the arithmetic right shift. > > Instruction gain -16 for 7: {r103:TI=r101:TI>>0x8;clobber flags:CC;} > Total gain: -3 > Chain #1 conversion is not profitable > > This is reporting that the ASHIFTRT is four instructions worse using > vectors than in scalar form, which is incorrect as the AVX2 expansion > of this shift only requires three instructions (and the scalar form > requires two). > > With more accurate costs in timode_scalar_chain::compute_convert_gain > we now see (with -O2 -mavx2): > > Instruction gain -4 for 7: {r103:TI=r101:TI>>0x8;clobber flags:CC;} > Total gain: 9 > Converting chain #1... > > which results in: > > foo: vmovdqa m1(%rip), %xmm0 > vpslldq $1, %xmm0, %xmm0 > vpsrad $8, %xmm0, %xmm1 > vpsrldq $1, %xmm0, %xmm0 > vpblendd $7, %xmm0, %xmm1, %xmm0 > vmovdqa %xmm0, m1(%rip) > ret > > This patch has been tested on x86_64-pc-linux-gnu with make bootstrap > and make -k check, both with and without --target_board=unix{-m32} > with no new failures. No new testcase (yet) as the code for both the > vector and scalar forms of the above function are still suboptimal > so code generation is in flux, but this improvement should be a step > in the right direction. Ok for mainline? > > > 2024-08-24 Roger Sayle <ro...@nextmovesoftware.com> > > gcc/ChangeLog > * config/i386/i386-features.cc (compute_convert_gain) > <case ASHIFTRT>: Update to match ix86_expand_v1ti_ashiftrt. > > TARGET_AVX2 always implies TARGET_SSE4_1, so there is no need to OR them > together. > OK with above change. Thanks, Uros. > >