[PATCH] Don't try bswap + rotate when TYPE_PRECISION(n->type) > n->range.

2023-06-01 Thread liuhongt via Gcc-patches
For the testcase in the PR, we have br64 = br; br64 = ((br64 << 16) & 0x00ffull) | (br64 & 0xff00ull); n->n: 0x300200. n->range: 32. n->type: uint64. The original code assumes n->range is same as TYPE PRECISION(n->type), and tries to rotate the mask from 0x30200

[PATCH] i386: Add missing vector truncate patterns [PR92658].

2023-06-01 Thread liuhongt via Gcc-patches
Add missing insn patterns for v2si -> v2hi/v2qi and v2hi-> v2qi vector truncate. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/92658 * config/i386/mmx.md (truncv2hiv2qi2): New define_insn. (truncv2si2): Ditto. gcc/testsu

[PATCH] [vect]Use intermiediate integer type for float_expr/fix_trunc_expr when direct optab is not existed.

2023-06-01 Thread liuhongt via Gcc-patches
We have already use intermidate type in case WIDEN, but not for NONE, this patch extended that. I didn't do that in pattern recog since we need to know whether the stmt belongs to any slp_node to decide the vectype, the related optabs are checked according to vectype_in and vectype_out. For non-sl

[PATCH] [x86] Add missing vec_pack/unpacks patterns for _Float16 <-> int/float conversion.

2023-06-04 Thread liuhongt via Gcc-patches
This patch only support vec_pack/unpacks optabs for vector modes whose lenth >= 128. For 32/64-bit vector, they're more hanlded by BB vectorizer with truncmn2/extendmn2/fix{,uns}_truncmn2. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to push to trunk. gcc/ChangeLog: *

[PATCH] Fold _mm{, 256, 512}_abs_{epi8, epi16, epi32, epi64} into gimple ABSU_EXPR + VCE.

2023-06-05 Thread liuhongt via Gcc-patches
r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for TYPE_MIN, but PABSB will store unsigned result into dst. The patch uses ABSU_EXPR + VCE instead of ABS_EXPR. Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit vector absm2 is guarded with TARGET_MMX_WITH_SSE. B

[PATCH] Don't fold _mm{, 256}_blendv_epi8 into (mask < 0 ? src1 : src2) when -funsigned-char.

2023-06-05 Thread liuhongt via Gcc-patches
Since mask < 0 will be always false when -funsigned-char, but vpblendvb needs to check the most significant bit. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk and backport to GCC12/GCC13 release branch? gcc/ChangeLog: PR target/110108 * config/i386/i386-b

[PATCH v2] Explicitly view_convert_expr mask to signed type when folding pblendvb builtins.

2023-06-06 Thread liuhongt via Gcc-patches
> I think this is a better patch and will always be correct and still > get folded at the gimple level (correctly): > diff --git a/gcc/config/i386/i386.cc b/gcc/config/i386/i386.cc > index d4ff56ee8dd..02bf5ba93a5 100644 > --- a/gcc/config/i386/i386.cc > +++ b/gcc/config/i386/i386.cc > @@ -18561,8

[PATCH 1/2] Fold _mm{, 256, 512}_abs_{epi8, epi16, epi32, epi64} into gimple ABSU_EXPR + VCE.

2023-06-06 Thread liuhongt via Gcc-patches
r14-1145 fold the intrinsics into gimple ABS_EXPR which has UB for TYPE_MIN, but PABSB will store unsigned result into dst. The patch uses ABSU_EXPR + VCE instead of ABS_EXPR. Also don't fold _mm_abs_{pi8,pi16,pi32} w/o TARGET_64BIT since 64-bit vector absm2 is guarded with TARGET_MMX_WITH_SSE. g

[PATCH] [x86] Use x instead of v for alternative 2 (v, BH) in mov_internal.

2023-06-13 Thread liuhongt via Gcc-patches
Since there's no evex version for vpcmpeq ymm, ymm, ymm. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to push to trunk and backport to GCC13. gcc/ChangeLog: PR target/110227 * config/i386/sse.md (mov_internal>): Use x instead of v for alternative 2 sinc

[PATCH 1/2] Reimplement packuswb/packusdw with UNSPEC_US_TRUNCATE instead of original us_truncate.

2023-06-15 Thread liuhongt via Gcc-patches
packuswb/packusdw does unsigned saturation for signed source, but rtl us_truncate means does unsigned saturation for unsigned source. So for value -1, packuswb will produce 0, but us_truncate produces 255. The patch reimplement those related patterns and functions with UNSPEC_US_TRUNCATE instead of

[PATCH 2/2] Refined 256/512-bit vpacksswb/vpackssdw patterns.

2023-06-15 Thread liuhongt via Gcc-patches
The packing in vpacksswb/vpackssdw is not a simple concat, it's an interweave from src1 and src2 for every 128 bit(or 64-bit for the ss_truncate result). .i.e. dst[192-255] = ss_truncate (src2[128-255]) dst[128-191] = ss_truncate (src1[128-255]) dst[64-127] = ss_truncate (src2[0-127]) dst[0-63] =

[PATCH] Remove # from one_cmpl2 assemble output.

2023-07-17 Thread liuhongt via Gcc-patches
optimize_insn_for_speed () in assemble output is not aligned with splitter condition, and it cause an ICE when building SPEC2017 blender_r. Not sure if ctrl is supposed to be reliable in assemble output, the patch just remove that as a walkaround. Bootstrapped and regtested on x86_64-pc-linux-gnu

[PATCH] Fix fp16 related testcase failure for i686.

2023-07-19 Thread liuhongt via Gcc-patches
> I see some regressions most likely with this change on i686-linux, > in particular: > +FAIL: gcc.dg/pr107547.c (test for excess errors) > +FAIL: gcc.dg/torture/floatn-convert.c -O0 (test for excess errors) > +UNRESOLVED: gcc.dg/torture/floatn-convert.c -O0 compilation failed to > produce execu

[PATCH] Optimize vlddqu to vmovdqu for TARGET_AVX

2023-07-20 Thread liuhongt via Gcc-patches
For Intel processors, after TARGET_AVX, vmovdqu is optimized as fast as vlddqu, UNSPEC_LDDQU can be removed to enable more optimizations. Can someone confirm this with AMD folks? If AMD doesn't like such optimization, I'll put my optimization under micro-architecture tuning. Bootstrapped and regte

[PATCH] [x86] Add UNSPEC_MASKOP to vpbroadcastm pattern.

2023-07-27 Thread liuhongt via Gcc-patches
Prevent rtl optimization of vec_duplicate + zero_extend to vpbroadcastm since there could be an extra kmov after RA. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} Ready to push to trunk. gcc/ChangeLog: PR target/110788 * config/i386/sse.md (avx512cd_maskb_vec_dup): Add

[PATCH] Adjust testcase for more optimal codegen.

2023-07-31 Thread liuhongt via Gcc-patches
After b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb is the first bad commit commit b9d7140c80bd3c7355b8291bb46f0895dcd8c3cb Author: Jan Hubicka Date: Fri Jul 28 09:16:09 2023 +0200 loop-split improvements, part 1 Now we have vpbroadcastd %ecx, %xmm0 vpaddd .LC3(%rip), %xmm0, %xmm0 v

[PATCH] Support vec_fmaddsub/vec_fmsubadd for vector HFmode.

2023-08-01 Thread liuhongt via Gcc-patches
AVX512FP16 supports vfmaddsubXXXph and vfmsubaddXXXph. Also remove scalar mode from fmaddsub/fmsubadd pattern since there's no scalar instruction for that. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to push to trunk. gcc/ChangeLog: PR target/81904 * config/i3

[PATCH] Optimize vlddqu + inserti128 to vbroadcasti128

2023-08-01 Thread liuhongt via Gcc-patches
In [1], I propose a patch to generate vmovdqu for all vlddqu intrinsics after AVX2, it's rejected as > The instruction is reachable only as __builtin_ia32_lddqu* (aka > _mm_lddqu_si*), so it was chosen by the programmer for a reason. I > think that in this case, the compiler should not be too smart

[PATCH] Allocate general register(memory/immediate) for 16/32/64-bit vector bit_op patterns.

2022-07-10 Thread liuhongt via Gcc-patches
And split it to GPR-version instruction after reload. This will enable below optimization for 16/32/64-bit vector bit_op - movd(%rdi), %xmm0 - movd(%rsi), %xmm1 - pand%xmm1, %xmm0 - movd%xmm0, (%rdi) + movl(%rsi), %eax + andl%eax, (%rdi)

[PATCH] [RFC]Support vectorization for Complex type.

2022-07-10 Thread liuhongt via Gcc-patches
The patch only handles load/store(including ctor/permutation, except gather/scatter) for complex type, other operations don't needs to be handled since they will be lowered by pass cplxlower.(MASK_LOAD is not supported for complex type, so no need to handle either). Instead of support vector(2) _C

[PATCH] Extend 64-bit vector bit_op patterns with ?r alternative

2022-07-13 Thread liuhongt via Gcc-patches
And split it to GPR-version instruction after reload. > ?r was introduced under the assumption that we want vector values > mostly in vector registers. Currently there are no instructions with > memory or immediate operand, so that made sense at the time. Let's > keep ?r until logic instructions w

[PATCH] Extend 16/32-bit vector bit_op patterns with (m, 0, i)(vertical) alternative.

2022-07-17 Thread liuhongt via Gcc-patches
And split it after reload. >IMO, the only case it is worth adding is a direct immediate store to >memory, which HJ recently added. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/106038 * config/i386/mmx.md (3): Extend to AND mem,

[PATCH V2] [RFC]Support vectorization for Complex type.

2022-07-17 Thread liuhongt via Gcc-patches
V2 update: Handle VMAT_ELEMENTWISE, VMAT_CONTIGUOUS_PERMUTE, VMAT_STRIDED_SLP, VMAT_CONTIGUOUS_REVERSE, VMAT_CONTIGUOUS_DOWN for complex type. I've run SPECspeed@2017 627.cam4_s, there's some vectorization cases, but no big performance impact(since this patch only handle load/store). Any co

[PATCH V2] Extend 16/32-bit vector bit_op patterns with (m, 0, i) alternative.

2022-07-18 Thread liuhongt via Gcc-patches
And split it after reload. > You will need ix86_binary_operator_ok insn constraint here with > corresponding expander using ix86_fixup_binary_operands_no_copy to > prepare insn operands. Split define_expand with just register_operand, and allow memory/immediate in define_insn, assume combine/forwp

[PATCH] Move pass_cse_sincos after vectorizer.

2022-07-19 Thread liuhongt via Gcc-patches
__builtin_cexpi can't be vectorized since there's gap between it and vectorized sincos version(In libmvec, it passes a double and two double pointer and returns nothing.) And it will lose some vectorization opportunity if sin & cos are optimized to cexpi before vectorizer. I'm trying to add vect_r

gcc-patches@gcc.gnu.org

2022-07-19 Thread liuhongt via Gcc-patches
> My original comments still stand (it feels like this should be more generic). > Can we go the way lowering complex loads/stores first?  A large part > of the testcases > added by the patch should pass after that. This is the patch as suggested, one additional change is handling COMPLEX_CST for r

[PATCH V3] Extend 16/32-bit vector bit_op patterns with (m, 0, i) alternative.

2022-07-20 Thread liuhongt via Gcc-patches
And split it after reload. gcc/ChangeLog: PR target/106038 * config/i386/mmx.md (3): New define_expand, it's original "3". (*3): New define_insn, it's original "3" be extended to handle memory and immediate operand with ix86_binary_operator_ok. Also

[PATCH] Adjust testcase.

2022-07-21 Thread liuhongt via Gcc-patches
r13-1762-gf9d4c3b45c5ed5f45c8089c990dbd4e181929c3d lower complex type move to scalars, but testcase pr23911 is supposed to scan __complex__ constant which is never available, so adjust testcase to scan IMAGPART/REALPART_EXPR constants separately. Pushed as obvious patch. gcc/testsuite/ChangeLog

[RFC: PATCH] Extend vectorizer to handle nonlinear induction for neg, mul/lshift/rshift with a constant.

2022-08-03 Thread liuhongt via Gcc-patches
For neg, the patch create a vec_init as [ a, -a, a, -a, ... ] and no vec_step is needed to update vectorized iv since vf is always multiple of 2(negative * negative is positive). For shift, the patch create a vec_init as [ a, a >> c, a >> 2*c, ..] as vec_step as [ c * nunits, c * nunits, c * nuni

[PATCH] Fix ICE in rtl check when bootstrap.

2023-08-07 Thread liuhongt via Gcc-patches
/var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/libgfortran/generated/matmul_i1.c: In function ‘matmul_i1_avx512f’: /var/tmp/portage/sys-devel/gcc-14.0.0_pre20230806/work/gcc-14-20230806/libgfortran/generated/matmul_i1.c:1781:1: internal compiler error: RTL check: expected

[PATCH] i386: Clear upper bits of XMM register for V4HFmode/V2HFmode operations [PR110762]

2023-08-07 Thread liuhongt via Gcc-patches
Similar like r14-2786-gade30fad6669e5, the patch is for V4HF/V2HFmode. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/110762 * config/i386/mmx.md (3): Changed from define_insn to define_expand and break into .. (v4

[PATCH] [X86] Workaround possible CPUID bug in Sandy Bridge.

2023-08-08 Thread liuhongt via Gcc-patches
Don't access leaf 7 subleaf 1 unless subleaf 0 says it is supported via EAX. Intel documentation says invalid subleaves return 0. We had been relying on that behavior instead of checking the max sublef number. It appears that some Sandy Bridge CPUs return at least the subleaf 0 EDX value for subl

[PATCH V2] [X86] Workaround possible CPUID bug in Sandy Bridge.

2023-08-08 Thread liuhongt via Gcc-patches
> Please rather do it in a more self-descriptive way, as proposed in the > attached patch. You won't need a comment then. > Adjusted in V2 patch. Don't access leaf 7 subleaf 1 unless subleaf 0 says it is supported via EAX. Intel documentation says invalid subleaves return 0. We had been relying

[PATCH] Rename local variable subleaf_level to max_subleaf_level.

2023-08-08 Thread liuhongt via Gcc-patches
This minor fix is preapproved in [1]. Committed to trunk. [1] https://gcc.gnu.org/pipermail/gcc-patches/2023-August/626758.html gcc/ChangeLog: * common/config/i386/cpuinfo.h (get_available_features): Rename local variable subleaf_level to max_subleaf_level. --- gcc/common/config

[PATCH] i386: Do not sanitize upper part of V2HFmode and V4HFmode reg with -fno-trapping-math [PR110832]

2023-08-09 Thread liuhongt via Gcc-patches
Also add ix86_partial_vec_fp_math to to condition of V2HF/V4HF named patterns in order to avoid generation of partial vector V8HFmode trapping instructions. Bootstrapped and regtseted on x86_64-pc-linux-gnu{-m32,} Ok for trunk? gcc/ChangeLog: PR target/110832 * config/i386/mmx.md

[PATCH] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions.

2023-08-09 Thread liuhongt via Gcc-patches
Currently we have 3 different independent tunes for gather "use_gather,use_gather_2parts,use_gather_4parts", similar for scatter, there're "use_scatter,use_scatter_2parts,use_scatter_4parts" The patch support 2 standardizing options to enable/disable vectorization for all gather/scatter instructio

[PATCH] Software mitigation: Disable gather generation in vectorization for GDS affected Intel Processors.

2023-08-10 Thread liuhongt via Gcc-patches
For more details of GDS (Gather Data Sampling), refer to https://www.intel.com/content/www/us/en/developer/articles/technical/software-security-guidance/advisory-guidance/gather-data-sampling.html After microcode update, there's performance regression. To avoid that, the patch disables gather gene

[PATCH V2] Support -m[no-]gather -m[no-]scatter to enable/disable vectorization for all gather/scatter instructions

2023-08-10 Thread liuhongt via Gcc-patches
Rename original use_gather to use_gather_8parts, Support -mtune-ctrl={,^}use_gather to set/clear tune features use_gather_{2parts, 4parts, 8parts}. Support the new option -mgather as alias of -mtune-ctrl=, use_gather, ^use_gather. Similar for use_scatter. How about this version? gcc/ChangeLog:

[PATCH] Generate vmovapd instead of vmovsd for moving DFmode between SSE_REGS.

2023-08-13 Thread liuhongt via Gcc-patches
vmovapd can enable register renaming and have same code size as vmovsd. Similar for vmovsh vs vmovaps, vmovaps is 1 byte less than vmovsh. When TARGET_AVX512VL is not available, still generate vmovsd/vmovss/vmovsh to avoid vmovapd/vmovaps zmm16-31. Bootstrapped and regtested on x86_64-pc-linux-gn

[PATCH] Support -march=gracemont

2023-08-17 Thread liuhongt via Gcc-patches
Alderlake-N is E-core only, add it as an alias of Alderlake. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Any comments? gcc/ChangeLog: * common/config/i386/cpuinfo.h (get_intel_cpu): Detect Alderlake-N. * common/config/i386/i386-common.cc (alias_table): Suppo

[PATCH] Mention Intel -march=gracemont for Alderlake-N.

2023-08-20 Thread liuhongt via Gcc-patches
--- htdocs/gcc-14/changes.html | 4 1 file changed, 4 insertions(+) diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html index eae25f1a..2c888660 100644 --- a/htdocs/gcc-14/changes.html +++ b/htdocs/gcc-14/changes.html @@ -151,6 +151,10 @@ a work-in-progress. -march=luna

[PATCH] Adjust testcase for Intel GDS.

2023-08-21 Thread liuhongt via Gcc-patches
gcc/testsuite/ChangeLog: * gcc.target/i386/avx512f-pr88464-2.c: Add -mgather to options. * gcc.target/i386/avx512f-pr88464-3.c: Ditto. * gcc.target/i386/avx512f-pr88464-4.c: Ditto. * gcc.target/i386/avx512f-pr88464-6.c: Ditto. * gcc.target/i386/avx51

[PATCH] [x86] Testcase fix.

2023-08-21 Thread liuhongt via Gcc-patches
Commit as an abvious fix. gcc/testsuite/ChangeLog: * gcc.target/i386/invariant-ternlog-1.c: Only scan %rdx under TARGET_64BIT. --- gcc/testsuite/gcc.target/i386/invariant-ternlog-1.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/gcc/testsuite/gcc.target/i386

[PATCH] [vect]Use intermiediate integer type for float_expr/fix_trunc_expr when direct optab is not existed.

2023-06-20 Thread liuhongt via Gcc-patches
I notice there's some refactor in vectorizable_conversion for code_helper,so I've adjusted my patch to that. Here's the patch I'm going to commit. We have already use intermidate type in case WIDEN, but not for NONE, this patch extended that. gcc/ChangeLog: PR target/110018 * tre

[PATCH] Refine maskloadmn pattern with UNSPEC_MASKLOAD.

2023-06-20 Thread liuhongt via Gcc-patches
If mem_addr points to a memory region with less than whole vector size bytes of accessible memory and k is a mask that would prevent reading the inaccessible bytes from mem_addr, add UNSPEC_MASKLOAD to prevent it to be transformed to vpblendd. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32

[PATCH 2/3] Don't use intermiediate type for FIX_TRUNC_EXPR when ftrapping-math.

2023-06-25 Thread liuhongt via Gcc-patches
> > Hmm, good question. GENERIC has a direct truncation to unsigned char > > for example, the C standard generally says if the integral part cannot > > be represented then the behavior is undefined. So I think we should be > > safe here (0x1.0p32 doesn't fit an int). > > We should be following An

[PATCH 3/3] [aarch64] Adjust testcase to match assembly output after r14-2007.

2023-06-25 Thread liuhongt via Gcc-patches
The new assembly looks better than original one, so I adjust those testcases. Ok for trunk? gcc/testsuite/ChangeLog: PR tree-optimization/110371 PR tree-optimization/110018 * gcc.target/aarch64/sve/unpack_fcvt_signed_1.c: Scan scvt + sxtw instead of scvt + zip1 + z

[PATCH 1/3] Use cvt_op to save intermediate type operand instead of "subtle" vec_dest.

2023-06-25 Thread liuhongt via Gcc-patches
When there're multiple operands in vec_oprnds0, vec_dest will be overwrited to vectype_out, but in multi_step_cvt case, cvt_type is expected. It caused an ICE when verify_gimple_in_cfg. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} and aarch64-linux-gnu. Ok for trunk? gcc/ChangeLog:

[PATCH] Issue a warning for conversion between short and __bf16 under TARGET_AVX512BF16.

2023-06-26 Thread liuhongt via Gcc-patches
__bfloat16 is redefined from typedef short to real __bf16 since GCC V13. The patch issues an warning for potential silent implicit conversion between __bf16 and short where users may only expect a data movement. To avoid too many false positive, warning is only under TARGET_AVX512BF16. Bootstrapp

[PATCH] [x86] Refine maskstore patterns with UNSPEC_MASKMOV.

2023-06-26 Thread liuhongt via Gcc-patches
At the rtl level, we cannot guarantee that the maskstore is not optimized to other full-memory accesses, as the current implementations are equivalent in terms of pattern, to solve this potential problem, this patch refines the pattern of the maskstore and the intrinsics with unspec. One thing I'm

[PATCH 2/2] Make option mvzeroupper independent of optimization level.

2023-06-26 Thread liuhongt via Gcc-patches
pass_insert_vzeroupper is under condition TARGET_AVX && TARGET_VZEROUPPER && flag_expensive_optimizations && !optimize_size But the document of mvzeroupper doesn't mention the insertion required -O2 and above, it may confuse users when they explicitly use -Os -mvzeroupper. mvzeroupp

[PATCH 1/2] Don't issue vzeroupper for vzeroupper call_insn.

2023-06-26 Thread liuhongt via Gcc-patches
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/82735 * config/i386/i386.cc (ix86_avx_u127_mode_needed): Don't emit vzeroupper for vzeroupper call_insn. gcc/testsuite/ChangeLog: * gcc.target/i386/avx-vzeroupper-30.

[PATCH] Break false dependence for vpternlog by inserting vpxor.

2023-07-03 Thread liuhongt via Gcc-patches
vpternlog is also used for optimization which doesn't need any valid input operand, in that case, the destination is used as input in the instruction and that creates a false dependence. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to push to trunk. gcc/ChangeLog: PR t

[PATCH] Disparage slightly for the alternative which move DFmode between SSE_REGS and GENERAL_REGS.

2023-07-05 Thread liuhongt via Gcc-patches
For testcase void __cond_swap(double* __x, double* __y) { bool __r = (*__x < *__y); auto __tmp = __r ? *__x : *__y; *__y = __r ? *__y : *__x; *__x = __tmp; } GCC-14 with -O2 and -march=x86-64 options generates the following code: __cond_swap(double*, double*): movsd xmm1, QWORD

[PATCH 2/2] Adjust rtx_cost for DF/SFmode AND/IOR/XOR/ANDN operations.

2023-07-05 Thread liuhongt via Gcc-patches
They should have same cost as vector mode since both generate pand/pandn/pxor/por instruction. Bootstrapped and regtested on x86_64-pc-linu-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: * config/i386/i386.cc (ix86_rtx_costs): Adjust rtx_cost for DF/SFmode AND/IOR/XOR/ANDN operations.

[PATCH 1/2] [x86] Add pre_reload splitter to detect fp min/max pattern.

2023-07-05 Thread liuhongt via Gcc-patches
We have ix86_expand_sse_fp_minmax to detect min/max sematics, but it requires rtx_equal_p for cmp_op0/cmp_op1 and if_true/if_false, for the testcase in the PR, there's an extra move from cmp_op0 to if_true, and it failed ix86_expand_sse_fp_minmax. This patch adds pre_reload splitter to detect the

[PATCH V2] [x86] Add pre_reload splitter to detect fp min/max pattern.

2023-07-06 Thread liuhongt via Gcc-patches
> Please split the above pattern into two, one emitting UNSPEC_IEEE_MAX > and the other emitting UNSPEC_IEEE_MIN. Splitted. > The test involves blendv instruction, which is SSE4.1, so it is > pointless to test it without -msse4.1. Please add -msse4.1 instead of > -march=x86_64 and use sse4_runtime

[PATCH] Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0'

2023-07-09 Thread liuhongt via Gcc-patches
False dependency happens when destination is only updated by pternlog. There is no false dependency when destination is also used in source. So either a pxor should be inserted, or input operand should be set with constraint '0'. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to p

[PATCH] Add peephole to eliminate redundant comparison after cmpccxadd.

2023-07-10 Thread liuhongt via Gcc-patches
Similar like we did for cmpxchg, but extended to all ix86_comparison_int_operator since cmpccxadd set EFLAGS exactly same as CMP. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}, Ok for trunk? gcc/ChangeLog: PR target/110591 * config/i386/sync.md (cmpccxadd_): Add a new

[PATCH v2] Break false dependence for vpternlog by inserting vpxor or setting constraint of input operand to '0'

2023-07-10 Thread liuhongt via Gcc-patches
Here's updated patch. 1. use optimize_insn_for_speed_p instead of using optimize_function_for_speed_p. 2. explicitly move memory to dest register to avoid false dependence in one_cmpl pattern. False dependency happens when destination is only updated by pternlog. There is no false dependency whe

[PATCH] Add peephole to eliminate redundant comparison after cmpccxadd.

2023-07-11 Thread liuhongt via Gcc-patches
Similar like we did for CMPXCHG, but extended to all ix86_comparison_int_operator since CMPCCXADD set EFLAGS exactly same as CMP. When operand order in CMP insn is same as that in CMPCCXADD, CMP insn can be eliminated directly. When operand order is swapped in CMP insn, only optimize cmpccxadd +

[PATCH] Fix typo in the testcase.

2023-07-11 Thread liuhongt via Gcc-patches
Antony Polukhin 2023-07-11 09:51:58 UTC There's a typo at https://gcc.gnu.org/git/?p=gcc.git;a=blob;f=gcc/testsuite/g%2B%2B.target/i386/pr110170.C;h=e638b12a5ee2264ecef77acca86432a9f24b103b;hb=d41a57c46df6f8f7dae0c0a8b349e734806a837b#l87 It should be `|| !test3() || !test3r()` rather than `|| !te

[PATCH] x86: Add a new option -mdaz-ftz to enable FTZ and DAZ flags in MXCSR.

2023-05-10 Thread liuhongt via Gcc-patches
> The quoted patch shows -shared in context and you didn't post a > backport version > to look at. But yes, we shouldn't change -shared behavior on a > branch, even less so make it > inconsistent between targets. Here's the patch. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for

[PATCH] Provide -fcf-protection=branch,return.

2023-05-11 Thread liuhongt via Gcc-patches
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/89701 * common.opt: Refactor -fcf-protection= to support combination of param. * lto-wrapper.c (merge_and_complain): Adjusted. * opts.c (parse_cf_protection_opt

[PATCH V2] Provide -fcf-protection=branch,return.

2023-05-13 Thread liuhongt via Gcc-patches
> I think this could be simplified if you use either EnumSet or > EnumBitSet instead in common.opt for `-fcf-protection=`. Use EnumSet instead of EnumBitSet since CF_FULL is not power of 2. It is a bit tricky for sets classification, cf_branch and cf_return should be in different sets, but they bo

[PATCH] Only use NO_REGS in cost calculation when !hard_regno_mode_ok for GENERAL_REGS and mode.

2023-05-16 Thread liuhongt via Gcc-patches
r14-172-g0368d169492017 replaces GENERAL_REGS with NO_REGS in cost calculation when the preferred register class are not known yet. It regressed powerpc PR109610 and PR109858, it looks too aggressive to use NO_REGS when mode can be allocated with GENERAL_REGS. The patch takes a step back, still use

[PATCH] Fold _mm{, 256, 512}_abs_{epi8, epi16, epi32, epi64} into gimple ABS_EXPR.

2023-05-22 Thread liuhongt via Gcc-patches
Also for 64-bit vector abs intrinsics _mm_abs_{pi8,pi16,pi32}. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/109900 * config/i386/i386.cc (ix86_gimple_fold_builtin): Fold _mm{,256,512}_abs_{epi8,epi16,epi32,epi64} and

[PATCH] [x86] Split notl + pbraodcast + pand to pbroadcast + pandn more modes.

2023-05-25 Thread liuhongt via Gcc-patches
r12-5595-gc39d77f252e895306ef88c1efb3eff04e4232554 adds 2 splitter to transform notl + pbroadcast + pand to pbroadcast + pandn for VI124_AVX2 which leaves out all DI-element-size ones as well as all 512-bit ones. This patch extend the splitter to VI_AVX2 which will handle DImode for AVX2, and V64QI

[PATCH] Disable avoid_false_dep_for_bmi for atom and icelake(and later) core processors.

2023-05-25 Thread liuhongt via Gcc-patches
lzcnt/tzcnt has been fixed since skylake, popcnt has been fixed since icelake. At least for icelake and later intel Core processors, the errata tune is not needed. And the tune isn't need for ATOM either. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to push to trunk. gcc/Chang

[PATCH] Support cond_add/sub/mul/div for vector float/double.

2021-08-01 Thread liuhongt via Gcc-patches
Hi: This patch supports cond_add/sub/mul/div expanders for vector float/double. There're still cond_fma/fms/fnms/fma/max/min/xor/ior/and left which I failed to figure out a testcase to validate them. Also cond_add/sub/mul for vector integer. Bootstrap is ok, survive the regression test on

[PATCH 2/6] [i386] Enable _Float16 type for TARGET_SSE2 and above.

2021-08-01 Thread liuhongt via Gcc-patches
gcc/ChangeLog: * config/i386/i386-modes.def (FLOAT_MODE): Define ieee HFmode. * config/i386/i386.c (enum x86_64_reg_class): Add X86_64_SSEHF_CLASS. (merge_classes): Handle X86_64_SSEHF_CLASS. (examine_argument): Ditto. (construct_container): Ditto.

[PATCH V3 0/6] Initial support for AVX512FP16

2021-08-01 Thread liuhongt via Gcc-patches
Update from v2: 1. Support -fexcess-precision=16 which will enable FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 when backend supports _Float16. 2. Update ix86_get_excess_precision, so -fexcess-precision=standard should not do anything different from -fexcess-precision=fast regarding _Float16. 3. Avoiding

[PATCH 4/6] Support -fexcess-precision=16 which will enable FLT_EVAL_METHOD_PROMOTE_TO_FLOAT16 when backend supports _Float16.

2021-08-01 Thread liuhongt via Gcc-patches
gcc/ada/ChangeLog: * gcc-interface/misc.c (gnat_post_options): Issue an error for -fexcess-precision=16. gcc/c-family/ChangeLog: * c-common.c (excess_precision_mode_join): Update below comments. (c_ts18661_flt_eval_method): Set excess_precision_type to EXC

[PATCH 1/6] Update hf soft-fp from glibc.

2021-08-01 Thread liuhongt via Gcc-patches
libgcc/ChangeLog * soft-fp/eqhf2.c: New file. * soft-fp/extendhfdf2.c: New file. * soft-fp/extendhfsf2.c: New file. * soft-fp/extendhfxf2.c: New file. * soft-fp/half.h (FP_CMP_EQ_H): New marco. * soft-fp/truncdfhf2.c: New file * soft-fp/trunc

[PATCH 3/6] [i386] libgcc: Enable hfmode soft-sf/df/xf/tf extensions and truncations.

2021-08-01 Thread liuhongt via Gcc-patches
libgcc/ChangeLog: * config/i386/32/sfp-machine.h (_FP_NANFRAC_H): New macro. * config/i386/64/sfp-machine.h (_FP_NANFRAC_H): Ditto. * config/i386/sfp-machine.h (_FP_NANSIGN_H): Ditto. * config/i386/t-softfp: Add hf soft-fp. * config.host: Add i386/64/t-softf

[PATCH 6/6] AVX512FP16: Support vector init/broadcast/set/extract for FP16.

2021-08-01 Thread liuhongt via Gcc-patches
gcc/ChangeLog: * config/i386/avx512fp16intrin.h (_mm_set_ph): New intrinsic. (_mm256_set_ph): Likewise. (_mm512_set_ph): Likewise. (_mm_setr_ph): Likewise. (_mm256_setr_ph): Likewise. (_mm512_setr_ph): Likewise. (_mm_set1_ph): Likewise.

[PATCH 5/6] AVX512FP16: Initial support for AVX512FP16 feature and scalar _Float16 instructions.

2021-08-01 Thread liuhongt via Gcc-patches
From: "Guo, Xuepeng" gcc/ChangeLog: * common/config/i386/cpuinfo.h (get_available_features): Detect FEATURE_AVX512FP16. * common/config/i386/i386-common.c (OPTION_MASK_ISA_AVX512FP16_SET, OPTION_MASK_ISA_AVX512FP16_UNSET, OPTION_MASK_ISA2_AVX512FP1

[PATCH] Add cond_add/sub/mul for vector integer modes.

2021-08-02 Thread liuhongt via Gcc-patches
Hi: This is a follow up of [1]. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Pushed to trunk. [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-August/576514.html gcc/ChangeLog: * config/i386/sse.md (cond_): New expander. (cond_mul): Ditto. gcc/testsuite/ChangeLo

[PATCH] [i386] Refine predicate of peephole2 to general_reg_operand. [PR target/101743]

2021-08-03 Thread liuhongt via Gcc-patches
Hi: The define_peephole2 which is added by r12-2640-gf7bf03cf69ccb7dc should only work on general registers, considering that x86 also supports mov instructions between gpr, sse reg, mask reg, limiting the peephole2 predicate to general_reg_operand. I failed to contruct a testcase, but I believ

[PATCH] [i386] Support cond_{fma, fms, fnma, fnms} for vector float/double under AVX512.

2021-08-03 Thread liuhongt via Gcc-patches
Hi: This patch add expanders cond_{fma,fms,fnms,fnms} for vector float/double modes. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Pushed to trunk. gcc/ChangeLog: * config/i386/sse.md (cond_fma): New expander. (cond_fms): Ditto. (cond_fnma): Ditto.

[PATCH] Add dg-require-effective-target for testcases.

2021-08-03 Thread liuhongt via Gcc-patches
Hi: Pushed to trunk as an abvious fix. gcc/testsuite/ChangeLog: * gcc.target/i386/cond_op_addsubmul_d-2.c: Add dg-require-effective-target for avx512. * gcc.target/i386/cond_op_addsubmul_q-2.c: Ditto. * gcc.target/i386/cond_op_addsubmul_w-2.c: Ditto. * gc

[PATCH 0/3] [i386] Support cond_{smax, smin, umax, umin, xor, ior, and} for vector modes under AVX512

2021-08-04 Thread liuhongt via Gcc-patches
Hi: Together with the previous 3 patches, all cond_op expanders of vector modes are supported (if they have a corresponding avx512 mask instruction). Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. liuhongt (3): [i386] Support cond_{smax,smin,umax,umin} for vector integer modes

[PATCH 1/3] [i386] Support cond_{smax, smin, umax, umin} for vector integer modes under AVX512.

2021-08-04 Thread liuhongt via Gcc-patches
gcc/ChangeLog: * config/i386/sse.md (cond_): New expander. gcc/testsuite/ChangeLog: * gcc.target/i386/cond_op_maxmin_b-1.c: New test. * gcc.target/i386/cond_op_maxmin_b-2.c: New test. * gcc.target/i386/cond_op_maxmin_d-1.c: New test. * gcc.target/i386/cond

[PATCH 3/3] [i386] Support cond_{xor, ior, and} for vector integer mode under AVX512.

2021-08-04 Thread liuhongt via Gcc-patches
gcc/ChangeLog: * config/i386/sse.md (cond_): New expander. gcc/testsuite/ChangeLog: * gcc.target/i386/cond_op_anylogic_d-1.c: New test. * gcc.target/i386/cond_op_anylogic_d-2.c: New test. * gcc.target/i386/cond_op_anylogic_q-1.c: New test. * gcc.target/i38

[PATCH 2/3] [i386] Support cond_{smax, smin} for vector float/double modes under AVX512.

2021-08-04 Thread liuhongt via Gcc-patches
gcc/ChangeLog: * config/i386/sse.md (cond_): New expander. gcc/testsuite/ChangeLog: * gcc.target/i386/cond_op_maxmin_double-1.c: New test. * gcc.target/i386/cond_op_maxmin_double-2.c: New test. * gcc.target/i386/cond_op_maxmin_float-1.c: New test. * gcc.ta

[PATCH] Make sure we're playing with integral modes before call extract_integral_bit_field.

2021-08-05 Thread liuhongt via Gcc-patches
Hi: --- OK, I think sth is amiss here upthread. insv/extv do look like they are designed to work on integer modes (but docs do not say anything about this here). In fact the caller of extract_bit_field_using_extv is named extract_integral_bit_field. Of course nothing seems to check what kind of m

[PATCH] [rtl-optimization] Simplify vector shift/rotate with const_vec_duplicate to vector shift/rotate with const_int element.

2021-08-06 Thread liuhongt via Gcc-patches
Hi: Bootstrapped and regtested on x86_64-linux-gnu{-m32,} Ok for trunk? gcc/ChangeLog: PR rtl-optimization/101796 * simplify-rtx.c (simplify_context::simplify_binary_operation_1): Simplify vector shift/rotate with const_vec_duplicate to vector shift/rot

[PATCH] [i386] Support cond_ashr/lshr/ashl for vector integer modes under AVX512.

2021-08-09 Thread liuhongt via Gcc-patches
Hi: Boostrapped and regtested on x86_64-linux-gnu{-m32,}. gcc/ChangeLog: * config/i386/sse.md (cond_): New expander. (VI248_AVX512VLBW): New mode iterator. * config/i386/predicates.md (nonimmediate_or_const_vec_dup_operand): New predicate. gcc/testsuite/ChangeLo

[PATCH] Extend ldexp{s, d}f3 to vscalefs{s, d} when TARGET_AVX512F and TARGET_SSE_MATH.

2021-08-10 Thread liuhongt via Gcc-patches
Hi: AVX512F supported vscalefs{s,d} which is the same as ldexp except the second operand should be floating point. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. gcc/ChangeLog: PR target/98309 * config/i386/i386.md (ldexp3): Extend to vscalefs[sd] when TARGET_

[PATCH] [i386] Combine avx_vec_concatv16si and avx512f_zero_extendv16hiv16si2_1 to avx512f_zero_extendv16hiv16si2_2.

2021-08-10 Thread liuhongt via Gcc-patches
Hi: Add define_insn_and_split to combine avx_vec_concatv16si/2 and avx512f_zero_extendv16hiv16si2_1 since the latter already zero_extend the upper bits, similar for other patterns which are related to pmovzx{bw,wd,dq}. It will do optimization like - vmovdqa %ymm0, %ymm0# 7 [c=4 l=

[PATCH] [i386] Introduce a scalar version of avx512f_vmscalef and adjust ldexp3 for it.

2021-08-11 Thread liuhongt via Gcc-patches
Hi: This is the patch i'm going to checkin. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}; 2021-08-12 Uros Bizjak gcc/ChangeLog: PR target/98309 * config/i386/i386.md (avx512f_scalef2): New define_insn. (ldexp3): Adjust for new define_insn.

[PATCH] [i386] Optimize vec_perm_expr to match vpmov{dw,qd,wb}.

2021-08-11 Thread liuhongt via Gcc-patches
Hi: This is another patch to optimize vec_perm_expr to match vpmov{dw,dq,wb} under AVX512. For scenarios(like pr101846-2.c) where the upper half is not used, this patch generates better code with only one vpmov{wb,dw,qd} instruction. For scenarios(like pr101846-3.c) where the upper half is actu

[PATCH] [i386] Optimize __builtin_shuffle_vector.

2021-08-15 Thread liuhongt via Gcc-patches
Hi: Here's updated patch which does 3 things: 1. Support vpermw/vpermb in ix86_expand_vec_one_operand_perm_avx512. 2. Support 256/128-bits vpermi2b in ix86_expand_vec_perm_vpermt2. 3. Add define_insn_and_split to optimize specific vector permutation to opmov{dw,wb,qd}. Bootstrapped and regtes

[PATCH] [i386] Fix ICE.

2021-08-16 Thread liuhongt via Gcc-patches
Hi: avx512f_scalef2 only accept register_operand for operands[1], force it to reg in ldexp3. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Ok for trunk. gcc/ChangeLog: PR target/101930 * config/i386/i386.md (ldexp3): Force operands[1] to reg. gcc/testsuite

[PATCH] [i386] Add x86 tune to enable v2df vector reduction by paddpd.

2021-08-17 Thread liuhongt via Gcc-patches
Hi: This patch add a new x86 tune named X86_TUNE_V2DF_REDUCTION_PREFER_HADDPD to enable haddpd for v2df vector reduction, the tune is disabled by default. Bootstrapped and regtested on x86_64-linux-gnu{-m32,} Ok for trunk? gcc/ChangeLog: PR target/97147 * config/i386/i386.h

[PATCH] Revert "Add the member integer_to_sse to processor_cost as a cost simulation for movd/pinsrd. It will be used to calculate the cost of vec_construct."

2021-08-17 Thread liuhongt via Gcc-patches
This reverts commit 872da9a6f664a06d73c987aa0cb2e5b830158a10. PR target/101936 PR target/101929 Bootstrapped and regtested on x86_64-linux-gnu{-m32,} Pushed to master. --- gcc/config/i386/i386.c | 6 +- gcc/config/i386/i386.h | 1 - gcc/config/i386/x8

[PATCH] Disable slp in loop vectorizer when cost model is very-cheap.

2021-08-22 Thread liuhongt via Gcc-patches
Performance impact for the commit with option: -march=x86-64 -O2 -ftree-vectorize -fvect-cost-model=very-cheap SPEC2017 fprate 503.bwaves_rBuildSame 507.cactuBSSN_r -0.04 508.namd_r 0.14 510.parest_r-0.54 511.povray_r 0.10 519.lbm_r B

[PATCH] [i386] Fix ICE.

2021-08-23 Thread liuhongt via Gcc-patches
Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Pushed to trunk. gcc/ChangeLog: PR target/102016 * config/i386/sse.md (*avx512f_pshufb_truncv8hiv8qi_1): Add TARGET_AVX512BW to condition. gcc/testsuite/ChangeLog: PR target/102016 * gcc.target/i3

[PATCH] [i386] Optimize (a & b) | (c & ~b) to vpternlog instruction.

2021-08-23 Thread liuhongt via Gcc-patches
Also optimize below 3 forms to vpternlog, op1, op2, op3 are register_operand or unary_p as (not reg) A: (any_logic (any_logic op1 op2) op3) B: (any_logic (any_logic op1 op2) (any_logic op3 op4)) op3/op4 should be equal to op1/op2 C: (any_logic (any_logic (any_logic:op1 op2) op3) op4) op3/op4 shoul

[PATCH] Change illegitimate constant into memref of constant pool in change_zero_ext.

2021-08-24 Thread liuhongt via Gcc-patches
Hi: This patch extend change_zero_ext to change illegitimate constant into constant pool, this will enable simplification of below: Trying 5 -> 7: 5: r85:V4SF=[`*.LC0'] REG_EQUAL const_vector 7: r84:V4SF=vec_select(vec_concat(r85:V4SF,r85:V4SF),parallel) REG_DEAD r85:V4SF

  1   2   3   4   5   >