[PATCH] Check nonlinear iv in vect_can_advance_ivs_p.

2022-09-28 Thread liuhongt via Gcc-patches
vectorizable_nonlinear_induction doesn't always guard vect_peel_nonlinear_iv_init when it's called by vect_update_ivs_after_vectorizer which is supposed to be guarded by vect_can_advance_ivs_p. The patch put part codes from vectorizable_nonlinear_induction into a new function vect_can_peel_nonlinea

[PATCH] [x86] Fix unrecognizable insn of cvtss2si.

2022-10-09 Thread liuhongt via Gcc-patches
Adjust lrintmn2 operand preidcates according to real instructions. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok as an obvious fix? gcc/ChangeLog: PR target/107185 * config/i386/i386.md (lrint2): Swap predicate of operands[0] and operands[1]. gcc/testsuite

[PATCH] [x86] Add define_insn_and_split to support general version of "kxnor".

2022-10-11 Thread liuhongt via Gcc-patches
For genereal_reg_operand, it will be splitted into xor + not. For mask_reg_operand, it will be splitted with UNSPEC_MASK_OP just like what we did for other logic operations. The patch will optimize xor+not to kxnor when possible. Bootstrapped and regtested on x86_64-pc-linux-gnu. Ok for trunk? g

[PATCH] [i386] Replace ix86_gen_scratch_sse_rtx with gen_reg_rtx.

2022-02-28 Thread liuhongt via Gcc-patches
.. in ix86_expand_vector_move and ix86_convert_const_wide_int_to_broadcast(called by the former). ix86_expand_vector_move is called by emit_move_insn which is used by many pre_reload passes, ix86_gen_scratch_sse_rtx will break data flow when there's explict usage of xmm7/xmm15/xmm31. Bootstrapped

[PATCH] [i386] Optimize v4si broadcast for noavx512vl.

2022-03-03 Thread liuhongt via Gcc-patches
This is incremental patch based on [1], it enables optimization as below - vbroadcastss.LC1(%rip), %xmm0 + movl$-45, %edx + vmovd %edx, %xmm0 + vpshufd $0, %xmm0, %xmm0 According to microbenchmark, it's faster than broadcast from memory. [1] https://gcc.gnu.org/

[PATCH] [i386] Prevent vectorization for load from parm_decl at O2 to avoid STF issue.

2022-03-03 Thread liuhongt via Gcc-patches
For parameter passing through stack, vectorized load from parm_decl in callee may trigger serious STF issue. This is why GCC12 regresses 50% for cray at -O2 compared to GCC11. The patch add an extremely large number to stmt_cost to prevent vectorization for loads from parm_decl under very-cheap co

[PATCH V2] [i386] Optimize v4si broadcast for noavx512vl.

2022-03-06 Thread liuhongt via Gcc-patches
>What happens if you set preferred_for_speed to false for alternative 1? It works, and I've removed the newly added splitter in this patch. Also i tried to do similar things to *vec_dup with mode iterator AVX2_VEC_DUP_MODE, but it hit ICE during reload since x86 don't have direct move for QImode

[PATCH] [i386] Add extra cost for unsigned_load which may have stall forward issue.

2022-03-15 Thread liuhongt via Gcc-patches
This patch only handle pure-slp for by-value passed parameter which has nothing to do with IPA but psABI. For by-reference passed parameter IPA is required. The patch is aggressive in determining STLF failure, any unaligned_load for parm_decl passed by stack is thought to have STLF stall issue. It

[PATCH] [i386] Don't fold __builtin_ia32_blendvpd w/o sse4.2.

2022-03-16 Thread liuhongt via Gcc-patches
__builtin_ia32_blendvpd is defined under sse4.1 and gimple folded to ((v2di) c) < 0 ? b : a where vec_cmpv2di is under sse4.2 w/o which it's veclowered to scalar operations and not combined back in rtl. Bootstrap and regtest on x86_64-pc-linux-gnu{-m32,}. Ready push to main trunk. gcc/ChangeLog:

[PATCH] [i386] Add extra cost for unsigned_load which may have stall forward issue.

2022-03-16 Thread liuhongt via Gcc-patches
This patch only handle pure-slp for by-value passed parameter which has nothing to do with IPA but psABI. For by-reference passed parameter IPA is required. The patch is aggressive in determining STLF failure, any unaligned_load for parm_decl passed by stack is thought to have STLF stall issue. It

[PATCH] [i386] Add extra cost for unsigned_load which may have stall forward issue.

2022-03-16 Thread liuhongt via Gcc-patches
This patch only handle pure-slp for by-value passed parameter which has nothing to do with IPA but psABI. For by-reference passed parameter IPA is required. The patch is aggressive in determining STLF failure, any unaligned_load for parm_decl passed by stack is thought to have STLF stall issue. It

[PATCH] [avx512fp16] Refine HImode movement for "v" to "v".

2022-03-18 Thread liuhongt via Gcc-patches
Set attr from HImode to HFmode which uses vmovsh instead of vmovw for movment between sse registers. Bootstrapped and regstested on x86_64-pc-linux-gnu{-m32,}. Ok for main trunk? gcc/ChangeLog: PR target/104974 * config/i386/i386.md (*movhi_internal): Set attr type from HI

[PATCH] [i386] Extend splitter pattern to reversed condition by swapping then and else rtx. [PR target/104982]

2022-03-21 Thread liuhongt via Gcc-patches
Failed to match this instruction: (set (reg/v:SI 88 [ z ]) (if_then_else:SI (eq (zero_extract:SI (reg:SI 92) (const_int 1 [0x1]) (zero_extend:SI (subreg:QI (reg:SI 93) 0))) (const_int 0 [0])) (reg:SI 95) (reg:SI 94))) but it's equal t

[PATCH] Fix ICE caused by NULL_RTX returned by lowpart_subreg.

2022-03-22 Thread liuhongt via Gcc-patches
In validate_subreg, both (subreg:V2HF (reg:SI) 0) and (subreg:V8HF (reg:V2HF) 0) are valid, but not for (subreg:V8HF (reg:SI) 0) which causes ICE. Ideally it should be handled in validate_subreg to support subreg for all modes available in TARGET_CAN_CHANGE_MODE_CLASS, but that would be too risky

[PATCH] [i386] Fix typo in vec_setv8hi_0.

2022-03-27 Thread liuhongt via Gcc-patches
pinsrw is available for both reg and mem operand under sse2. pextrw requires sse4.1 for mem operands. The patch change attr "isa" for pinsrw mem alternative from sse4_noavx to noavx, will enable below optimization. -movzwl (%rdi), %eax pxor%xmm1, %xmm1 -pinsrw $0, %

[PATCH] Split vector load from parm_del to elemental loads to avoid STLF stalls.

2022-03-30 Thread liuhongt via Gcc-patches
Since cfg is freed before machine_reorg, just do a rough calculation of the window according to the layout. Also according to an experiment on CLX, set window size to 64. Currently only handle V2DFmode load since it doesn't need any scratch registers, and it's sufficient to recover cray performanc

[PATCH] Split vector load from parm_del to elemental loads to avoid STLF stalls.

2022-03-31 Thread liuhongt via Gcc-patches
Update in V2: 1. Use get_insns instead of FOR_EACH_BB_CFUN and FOR_BB_INSNS. 2. Return for any_uncondjump_p and ANY_RETURN_P. 3. Add dump info for spliting instruction. 4. Restrict ix86_split_stlf_stall_load under TARGET_SSE2. Since cfg is freed before machine_reorg, just do a rough calculation of

[PATCH V3] Split vector load from parm_del to elemental loads to avoid STLF stalls.

2022-04-01 Thread liuhongt via Gcc-patches
Update in V3: 1. Add -param=x86-stlf-window-ninsns= (default 64). 2. Exclude call in the window. Since cfg is freed before machine_reorg, just do a rough calculation of the window according to the layout. Also according to an experiment on CLX, set window size to 64. Currently only handle V2DFmod

[PATCH] Refine and/ior/xor/andn masked patterns for V*HFmode.

2022-04-05 Thread liuhongt via Gcc-patches
There's no masked vpandw or vpandb, similar for vpxor/vpor/vpandn. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} Ready to push to trunk. gcc/ChangeLog: * config/i386/sse.md (_andnot3_mask): Removed. (_andnot3): Disable V*HFmode patterns for mask_applied

[PATCH] Enhance vec_pack_trunc for integral mode mask.

2022-01-18 Thread liuhongt via Gcc-patches
> your description above hints at that the actual modes involved in the > vec_pack_sbool_trunc are the same so the TYPE_MODE (narrow_vectype) > and TYPE_MODE (vectype) are not the actual modes participating. I think > it would be way better to fix that. > > I suppose that since we know TYPE_VECTOR

[PATCH] [vect] Add vect_recog_cond_expr_convert_pattern.

2022-01-24 Thread liuhongt via Gcc-patches
The pattern converts (cond (cmp a b) (convert c) (convert d)) to (convert (cond (cmp a b) c d)) when 1) types_match (c, d) 2) single_use for (convert c) and (convert d) 3) TYPE_PRECISION (TREE_TYPE (c)) == TYPE_PRECISION (TREE_TYPE (a)) 4) INTEGERAL_TYPE_P (TREE_TYPE (c)) The pattern can save pack

[PATCH] [rtl/cprop_hardreg] Don't propagate for a more expensive reg-reg move.

2022-01-24 Thread liuhongt via Gcc-patches
For i386, it enables optimization like: vmovd %xmm0, %edx - vmovd %xmm0, %eax + movl%edx, %eax Bootstrapped and regtested on CLX for both x86_64-pc-linux-gnu{-m32,} and x86_64-pc-linux-gnu{-m32\ -march=native,\ -march=native} Ok for trunk? gcc/ChangeLog: PR

[PATCH] [i386] ICE: QImode(not SImode) operand should be passed to gen_vec_initv16qiqi in ashlv16qi3.

2022-02-09 Thread liuhongt via Gcc-patches
ix86_expand_vector_init expects vals to be a parallel containing values of individual fields which should be either element mode of the vector mode, or a vector mode with the same element mode and smaller number of elements. But in the expander ashlv16qi3, the second operand is SImode which can't

[PATCH] [i386] ICE: QImode(not SImode) operand should be passed to gen_vec_initv16qiqi in ashlv16qi3.

2022-02-09 Thread liuhongt via Gcc-patches
ix86_expand_vector_init expects vals to be a parallel containing values of individual fields which should be either element mode of the vector mode, or a vector mode with the same element mode and smaller number of elements. But in the expander ashlv16qi3, the second operand is SImode which can't

[PATCH] [vect] Add vect_recog_cond_expr_convert_pattern.

2022-02-09 Thread liuhongt via Gcc-patches
>But in principle @2 or @3 could safely differ in sign, you'd then need to >ensure >to insert sign conversions to @2/@3 to the signedness of @4/@5. Changed. >you are not testing for this anywhere? It's tested in vect_recog_cond_expr_convert_pattern, I've move it to match.pd >Btw, matching up the

[PATCH] Add single_use to simplification (uncond_op + vec_cond -> cond_op).

2022-02-10 Thread liuhongt via Gcc-patches
>>> Confirmed. When uncond_op is expensive (there's *div amongst them) that's >>> definitely unwanted. OTOH when it is cheap then combining will reduce >>> latency. >>> >>> GIMPLE wise it's a neutral transform if uncond_op is not single-use unless >>> we need two v_c_es. >> >> We can leave it t

[PATCH] Restrict the two sources of vect_recog_cond_expr_convert_pattern to be of the same type when convert is extension.

2022-02-16 Thread liuhongt via Gcc-patches
> > +(match (cond_expr_convert_p @0 @2 @3 @6) > > + (cond (simple_comparison@6 @0 @1) (convert@4 @2) (convert@5 @3)) > > + (if (types_match (TREE_TYPE (@2), TREE_TYPE (@3)) > > But in principle @2 or @3 could safely differ in sign, you'd then need to > ensure > to insert sign conversions to @2/@3

[PATCH] [i386] Clean up MPX-related bit_{MPX,BNDREGS,BNDCSR}.

2022-02-16 Thread liuhongt via Gcc-patches
Bootstrap and regrestest on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: * config/i386/cpuid.h (bit_MPX): Removed. (bit_BNDREGS): Ditto. (bit_BNDCSR): Ditto. --- gcc/config/i386/cpuid.h | 5 - 1 file changed, 5 deletions(-) diff --git a/gcc/config/i386/cp

[PATCH V2] Restrict the two sources of vect_recog_cond_expr_convert_pattern to be of the same type when convert is extension.

2022-02-16 Thread liuhongt via Gcc-patches
> I find this quite unreadable, it looks like if @2 and @3 are treated > differently. I think keeping the old 3 lines and just adding > && (TYPE_PRECISION (TREE_TYPE (@0)) >= TYPE_PRECISION (type) > || (TYPE_UNSIGNED (TREE_TYPE (@2)) > == TYPE_UNSIGNED (TREE_TYPE (@3)

[PATCH] [i386] Fix typo in v1ti3.

2022-02-23 Thread liuhongt via Gcc-patches
For evex encoding vp{xor,or,and}, suffix is needed. Or there would be an error for vpxor %ymm0, %ymm31, %ymm1 Error: unsupported instruction `vpxor' Bootstrapped and regtested x86_64-pc-linux-gnu{-m32,}. Pushed to trunk. gcc/ChangeLog: * config/i386/sse.md (v1ti3): Add suffix and repla

[PATCH] [i386] Don't fold builtin into gimple when isa mismatches.

2022-02-24 Thread liuhongt via Gcc-patches
The patch fixes ICE in ix86_gimple_fold_builtin. gcc/ChangeLog: PR target/104666 * config/i386/i386-expand.cc (ix86_check_builtin_isa_match): New func. (ix86_expand_builtin): Move code to ix86_check_builtin_isa_match and call it. * config/i386/i386-

[PATCH] Reduce cost of aligned sse register store.

2021-11-17 Thread liuhongt via Gcc-patches
Make them be equal to cost of unaligned ones to avoid odd alignment peeling. Impact for SPEC2017 on CLX: fprate: 503.bwaves_rBuildSame 507.cactuBSSN_r -0.22 508.namd_r -0.02 510.parest_r-0.28 511.povray_r-0.20 519.lbm_r BuildSame 521.wrf_r

[PATCH] Don't allow mask/sse/mmx mov in TLS code sequences.

2021-11-17 Thread liuhongt via Gcc-patches
As change in assembler, refer to [1], this patch disallow mask/sse/mmx mov in TLS code sequences which require integer MOV instructions. [1] https://sourceware.org/git/?p=binutils-gdb.git;a=patch;h=d7e3e627027fcf37d63e284144fe27ff4eba36b5 Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}.

[PATCH] Don't allow mask/sse/mmx mov in TLS code sequences.

2021-11-18 Thread liuhongt via Gcc-patches
>Why is the above declared as a special memory constraint? Also the Change to define_memory_constraint since it's ok for reload can make them match by converting the operand to the form ‘(mem (reg X))’.where X is a base register (from the register class specified by BASE_REG_CLASS >predicate comme

[PATCH] Fix typo in r12-5486.

2021-11-24 Thread liuhongt via Gcc-patches
TYPE_PRECISION (type) < TYPE_PRECISION (TREE_TYPE (@2)) supposed to check integer type but not pointer type, so use second parameter instead. i.e. first parameter is VPTR, second parameter is I4. 582DEF_SYNC_BUILTIN (BUILT_IN_ATOMIC_FETCH_OR_4, 583 "__atomic_fetch_or_4", 584

[PATCH] Fix regression introduced by r12-5536.

2021-11-28 Thread liuhongt via Gcc-patches
There're several failures reported in [1]: 1. unsupported instruction `pextrw` for "pextrw $0, %xmm31, 16(%rax)" %vpextrw should be used in output templates. 2. ICE in get_attr_memory for movhi_internal since some alternatives are marked as TYPE_SSELOG. Explicitly set memory_attr for those alterna

[PATCH] Optimize _Float16 usage for non AVX512FP16.

2021-11-28 Thread liuhongt via Gcc-patches
As discussed in PR, this patch do optimizations: 1. No memory is needed to move HI/HFmode between GPR and SSE registers under TARGET_SSE2 and above, pinsrw/pextrw are used for them w/o AVX512FP16. 2. Use gen_sse2_pinsrph/gen_vec_setv4sf_0 to replace ix86_expand_vector_set in extendhfsf2/truncsfhf2

[PATCH] [i386] Fix ICE in ix86_attr_length_immediate_default.

2021-11-30 Thread liuhongt via Gcc-patches
ix86_attr_length_immediate_default assume TYPE ishift only have 1 constant operand, but *x86_64_shld_1/*x86_shld_1/*x86_64_shrd_1/*x86_shrd_1 has 2, with condition: INTVAL (operands[3]) == 32 - INTVAL (operands[2]) or INTVAL (operands[3]) == 64 - INTVAL (operands[2]), and hit gcc_assert. Explicitly

[PATCH] [i386] Prefer INT_SSE_REGS for SSE_FLOAT_MODE_P in preferred_reload_class.

2021-12-02 Thread liuhongt via Gcc-patches
The patch helps reload to choose GENENRAL_REGS alternatives for SSE_FLOAT_MODE and enabled optimization like - vmovd %xmm0, -4(%rsp) - movl$1, %eax - addl-4(%rsp), %eax + movd%xmm0, %eax + addl$1, %eax Bootstrapped anf regtested on x86_64-pc-linux

[PATCH] [i386] Prefer INT_SSE_REGS for SSE_FLOAT_MODE_P in preferred_reload_class.

2021-12-02 Thread liuhongt via Gcc-patches
Hi: > Please also consider TARGET_INTER_UNIT_MOVES_TO_VEC and > TARGET_INTER_UNIT_MOVES_FROM_VEC. Here's updated patch. Also honor TARGET_INTER_UNIT_MOVES_TO/FROM_VEC and in preferred_{,out_}reload_class. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32\ -march=k8,\ -march=k8}. Ok? gcc/Cha

[PATCH] [i386] Prefer INT_SSE_REGS for SSE_FLOAT_MODE_P in preferred_reload_class.

2021-12-05 Thread liuhongt via Gcc-patches
When moves between integer and sse registers are cheap. 2021-12-06 Hongtao Liu Uroš Bizjak gcc/ChangeLog: PR target/95740 * config/i386/i386.c (ix86_preferred_reload_class): Allow integer regs when moves between register units are cheap. * config/i

[PATCH] Canonicalize vec_perm index to make the first index come from the first vector.

2022-10-18 Thread liuhongt via Gcc-patches
Fix unexpected non-canon form from gimple vector selector. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/107271 * config/i386/i386-expand.cc (ix86_vec_perm_index_canon): New. (expand_vec_perm_shufps_shufps): Call

[PATCH] [x86] Enable V4BFmode and V2BFmode.

2022-10-25 Thread liuhongt via Gcc-patches
Enable V4BFmode and V2BFmode with the same ABI as V4HFmode and V2HFmode. No real operation is supported for them except for movement. This should solve PR target/107261. Also I notice there's redundancy in VALID_AVX512FP16_REG_MODE, and remove V2BFmode remove it. Bootstrapped and regtested on x86

[PATCH] [x86] Fix incorrect digit constraint

2022-10-27 Thread liuhongt via Gcc-patches
Matching constraints are used in these circumstances. More precisely, the two operands that match must include one input-only operand and one output-only operand. Moreover, the digit must be a smaller number than the number of the operand that uses it in the constraint. In pr107057, the 2 operands

[PATCH V2] [x86] Fix incorrect digit constraint

2022-10-30 Thread liuhongt via Gcc-patches
>You have a couple of other patterns where operand 1 is matched to >produce vmovddup insn. These are *avx512f_unpcklpd512 and >avx_unpcklpd256. You can also remove expander in both >cases. Yes, changed in V2 patch. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? Matching

[PATCH] Enable more optimization for 32-bit/64-bit shrd/shld with imm shift count.

2022-10-30 Thread liuhongt via Gcc-patches
This patch doens't handle variable count since it require 5 insns to be combined to get wanted pattern, but current pass_combine only supports at most 4. This patch doesn't handle 16-bit shrd/shld either. Ideally, we can avoid redundancy of *x86_64_shld_shrd_1_nozext/*x86_shld_shrd_1_nozext if mid

[PATCH] Don't gimple fold ymm-version vblendvpd/vblendvps/vpblendvb w/o TARGET_AVX2

2022-08-23 Thread liuhongt via Gcc-patches
Since 256-bit vector integer comparison is under TARGET_AVX2, and gimple folding for vblendvpd/vblendvps/vpblendvb relies on that. Restrict gimple fold condition to TARGET_AVX2. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/106704

[PATCH V2] Extend vectorizer to handle nonlinear induction for neg, mul/lshift/rshift with a constant.

2022-08-28 Thread liuhongt via Gcc-patches
>Looks good overall - a few comments inline. Also can you please add >SLP support? >I've tried hard to fill in gaps where SLP support is missing since my >goal is still to get >rid of non-SLP. For slp with different induction type, they need separate iv update and an vector permutation. And if the

[PATCH] Fix _mm512_cvt_roundps_ph to generate sae instruction.

2022-09-04 Thread liuhongt via Gcc-patches
zmm-version vcvtps2ph is special, it encodes {sae} in evex, but put round control in the imm. For intrinsic _mm512_cvt_roundps_ph (a, imm), imm contains both {sae} and round control, we need to separate it in the assembly output since vcvtps2ph will ignore imm[3:7]. Corresponding llvm patch. Intri

[PATCH] Strip of a vector load which is only used partially.

2022-05-04 Thread liuhongt via Gcc-patches
Optimize _1 = *srcp_3(D); _4 = VEC_PERM_EXPR <_1, _1, { 4, 5, 6, 7, 4, 5, 6, 7 }>; _5 = BIT_FIELD_REF <_4, 128, 0>; to _1 = *srcp_3(D); _5 = BIT_FIELD_REF <_1, 128, 128>; the upper will finally be optimized to _5 = BIT_FIELD_REF <*srcp_3(D), 128, 128>; Bootstrapped and regtested on

[PATCH] Expand __builtin_memcmp_eq with ptest for OI/TImode.

2022-05-05 Thread liuhongt via Gcc-patches
Enable optimization for TImode only under 32-bit target, for 64-bit target there could be extra ineteger <-> sse move regarding psABI, not efficient. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} Ok for trunk? gcc/ChangeLog: PR target/104610 * config/i386/i386-expand.c

[PATCH] Expand __builtin_memcmp_eq with ptest for OImode.

2022-05-06 Thread liuhongt via Gcc-patches
This is adjusted patch only for OImode. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/104610 * config/i386/i386-expand.cc (ix86_expand_branch): Use ptest for QImode when code is EQ or NE. * config/i386/sse.md (cbr

[PATCH] [i386] Optimize movzwl + vmovd/vmovq to vmovw.

2022-05-08 Thread liuhongt via Gcc-patches
Similarly optimize movl + vmovq to vmovd. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/104915 * config/i386/sse.md (*vec_set_0_zero_extendhi): New pre_reload define_insn_and_split. (*vec_setv2di_0_zero_extendhi_1

[PATCH v2] Strip of a vector load which is only used partially.

2022-05-08 Thread liuhongt via Gcc-patches
Here's adjused patch. Ok for trunk? Optimize _4 = VEC_PERM_EXPR <_1, _1, { 4, 5, 6, 7, 4, 5, 6, 7 }>; _5 = BIT_FIELD_REF <_4, 128, 0>; to _5 = BIT_FIELD_REF <_1, 128, 128>; gcc/ChangeLog: PR tree-optimization/102583 * tree-ssa-forwprop.cc (simplify_bitfield_ref): Extende

[PATCH] [Middle-end] Enhance final_value_replacement_loop to handle bitwise induction.

2022-05-08 Thread liuhongt via Gcc-patches
This patch will enable below optimization: { - int bit; - long long unsigned int _1; - long long unsigned int _2; - [local count: 46707768]: - - [local count: 1027034057]: - # tmp_11 = PHI - # bit_13 = PHI - _1 = 1 << bit_13; - _2 = ~_1; - tmp_8 = _2 & tmp_11; - bit_9 = bit_13 +

[PATCH] [i386] Implement permutation with pslldq + psrldq + por when pshufb is not available.

2022-05-08 Thread liuhongt via Gcc-patches
pand/pandn may be used to clear upper/lower bits of the operands, in that case there will be 4-5 instructions for permutation, and it's still better than scalar codes. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/105354 * confi

[PATCH] Optimize vec_setv8{hi,hf}_0 + pmovzxbq to pmovzxbq.

2022-05-08 Thread liuhongt via Gcc-patches
Clean up of 16-bit uppers is not needed for pmovzxbq/pmovsxbq. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/105072 * config/i386/sse.md (*sse4_1_v2qiv2di2_1): New define_insn. (*sse4_1_zero_extendv2qiv2di2_2): Ne

[PATCH] Optimize vpermtiw/b to vpunpcklqdq for certain cases.

2022-05-13 Thread liuhongt via Gcc-patches
Assembly Optimization like: - vmovq %xmm0, %xmm2 - vmovdqa .LC0(%rip), %xmm0 vmovq %xmm1, %xmm1 - vpermi2w%xmm1, %xmm2, %xmm0 + vmovq %xmm0, %xmm0 + vpunpcklqdq %xmm1, %xmm0, %xmm0 ... -.LC0: - .value 0 - .value 1 - .valu

[PATCH v2] Optimize vpermtiw/b to vpunpcklqdq for certain cases.

2022-05-13 Thread liuhongt via Gcc-patches
Here's updated patch which adds ix86_pre_reload_split () to those 2 define_insn_and_splits. Assembly Optimization like: - vmovq %xmm0, %xmm2 - vmovdqa .LC0(%rip), %xmm0 vmovq %xmm1, %xmm1 - vpermi2w%xmm1, %xmm2, %xmm0 + vmovq %xmm0, %xmm0 + vpun

[PATCH] [i386] Fix ICE caused by wrong condition.

2022-05-13 Thread liuhongt via Gcc-patches
When d->perm[i] == d->perm[i-1] + 1 and d->perm[i] == nelt, it's not continuous. It should fail if there's more than 2 continuous areas. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR target/105587 * config/i386/i386-expand.cc (

[PATCH] Clamp vec_perm_expr index in simplify_bitfield_ref to avoid ICE.

2022-05-16 Thread liuhongt via Gcc-patches
Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} Ok for trunk? gcc/ChangeLog: PR tree-optimization/105591 * tree-ssa-forwprop.cc (simplify_bitfield_ref): Clamp vec_perm_expr index. gcc/testsuite/ChangeLog: * gcc.dg/pr105591.c: New test. --- gcc/testsuite

[PATCH] [i386] recognize bzhi pattern when there's zero_extendsidi.

2022-05-16 Thread liuhongt via Gcc-patches
backend has 16550(define_insn "*bmi2_bzhi_3_2" 16551 [(set (match_operand:SWI48 0 "register_operand" "=r") 16552(and:SWI48 16553 (plus:SWI48 16554(ashift:SWI48 (const_int 1) 16555 (match_operand:QI 2 "register_operand" "r")) 16556(

[PATCH] Increase move cost between mask and gpr.

2022-05-19 Thread liuhongt via Gcc-patches
kmovd only uses port5 which is often the bottleneck of performance. Also from latency perspective, spill and reload mostly could be STLF or even MRN which only take 1 cycle. So the patch increase move cost between gpr and mask to be the same as gpr <-> sse register. Bootstrapped and regtested on

[PATCH] Add a bit dislike for separate mem alternative when op is REG_P.

2022-05-24 Thread liuhongt via Gcc-patches
Rigt now, mem_cost for separate mem alternative is 1 * frequency which is pretty small and caused the unnecessary SSE spill in the PR, I've tried to rework backend cost model, but RA still not happy with that(regress somewhere else). I think the root cause of this is cost for separate 'm' alternati

[PATCH] Remove macro check for __AMX_BF16/INT8/TILE__ in header file.

2021-09-01 Thread liuhongt via Gcc-patches
Hi: Details discussed in PR. Bootstrapped and regtested on x86-64_linux-gnu{-m32,}. Pushed to master and GCC-11. gcc/ChangeLog: PR target/102166 * config/i386/amxbf16intrin.h : Remove macro check for __AMX_BF16__. * config/i386/amxint8intrin.h : Remove macro check fo

[PATCH] Explicitly add -msse2 to compile HF related libgcc source file.

2021-09-03 Thread liuhongt via Gcc-patches
For 32-bit libgcc configure w/o sse2, there's would be an error since GCC only support _Float16 under sse2. Explicitly add -msse2 for those HF related libgcc functions, so users can still link them w/ the upper configuration. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Ok for trunk?

[PATCH] Enable auto-vectorization at O2 with very-cheap cost model.

2021-09-06 Thread liuhongt via Gcc-patches
Hi: As discussed in [1], most of (currently unopposed) targets want auto-vectorization at O2, and IMHO now would be a good time to enable O2 vectorization for GCC trunk, so it would leave enough time to expose related issues and fix them. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}

[PATCH] Adjust the wording for x86 _Float16 type.

2021-09-06 Thread liuhongt via Gcc-patches
Hi: As discussed in [1], adjust the layout for x86 _Float16 description. Bootstrappedn and regtested on x86_64-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: * doc/extend.texi: (@node Floating Types): Adjust the wording. (@node Half-Precision): Ditto. --- gcc/doc/extend.te

[PATCH] Avoid FROM being overwritten in expand_fix.

2021-09-06 Thread liuhongt via Gcc-patches
Hi: For the conversion from _Float16 to int, if the corresponding optab does not exist, the compiler will try the wider mode (SFmode here), but when floatsfsi exists but FAIL, FROM will be rewritten, which leads to a PR runtime error. Boostrapped and regtested on x86_64-linux-gnu{-m32,}. Ok

[PATCH] [i386] Optimize v4sf reduction.

2021-09-07 Thread liuhongt via Gcc-patches
Hi: The optimization is decribled in PR. The two instruction sequences are almost as fast, but the optimized instruction sequences could be one mov instruction less on sse2 and 2 mov instruction less on sse3. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. gcc/ChangeLog: PR

[PATCH] Optimize vec_extract for 256/512-bit vector when index exceeds the lower 128 bits.

2021-09-08 Thread liuhongt via Gcc-patches
Hi: As decribed in PR, valign{d,q} can be used for vector extract one element. For elements located in the lower 128 bits, only one instruction is needed, so this patch only optimizes elements located above 128 bits. The optimization is like: - vextracti32x8 $0x1, %zmm0, %ymm0 - v

[PATCH] [i386] Remove copysign post_reload splitter for scalar modes.

2021-09-09 Thread liuhongt via Gcc-patches
Hi: As a follow up of [1], the patch removes all scalar mode copysign related post_reload splitter/define_insn and expand copysign directly into below using paradoxical subregs. op3 = op1 & ~mask; op4 = op2 & mask; dest = op3 | op4; It can sometimes generate better code just like avx512dq

[PATCH] Relax condition of (vec_concat:M(vec_select op0 idx0)(vec_select op0 idx1)) to allow different modes between op0 and M, but have same inner mode.

2021-09-09 Thread liuhongt via Gcc-patches
Currently for (vec_concat:M (vec_select op0 idx1)(vec_select op0 idx2)), optimizer wouldn't simplify if op0 has different mode with M, but that's too restrict which will prevent below optimization, the condition can be relaxed to op0 must have same inner mode with M. (set (reg:V2DF 87 [ xx ])

[PATCH] Disallow paradoxical subregs when outer mode is SCALAR_FLOAT_MODE_P.

2021-09-09 Thread liuhongt via Gcc-patches
Hi: In general_operand, paradoxical subregs w/ outermode SCALAR_FLOAT_MODE_P are not allowed unless lra_in_progress, so this patch add the restriction to validate_subreg as well. Bootstrapped and regtested on x86_64-linux-gnu{-m32,} Also the newly added tests are compiled with aarch64-linu

[PATCH 0/2] Revert r12-3277 since it caused regressions on many other targets.

2021-09-10 Thread liuhongt via Gcc-patches
Hi: Details discussed in https://gcc.gnu.org/pipermail/gcc-patches/2021-September/579170.html. Bootstrapped and regtested on x86_64-linux-gnu{-m32,}. Ok for trunk? liuhongt (2): Revert "Get rid of all float-int special cases in validate_subreg." validate_subreg b

[PATCH 1/2] Revert "Get rid of all float-int special cases in validate_subreg."

2021-09-10 Thread liuhongt via Gcc-patches
This reverts commit d2874d905647a1d146dafa60199d440e837adc4d. PR target/102254 PR target/102154 PR target/102211 --- gcc/emit-rtl.c | 40 1 file changed, 40 insertions(+) diff --git a/gcc/emit-rtl.c b/gcc/emit-rtl.c index 77ea8948ee8..ff3b4449b37 100644 -

[PATCH 2/2] validate_subreg before call gen_lowpart to avoid ICE.

2021-09-10 Thread liuhongt via Gcc-patches
gcc/ChangeLog: * expmed.c (extract_bit_field_using_extv): validate_subreg before call gen_lowpart. --- gcc/expmed.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/gcc/expmed.c b/gcc/expmed.c index 3143f38e057..10d62d857a8 100644 --- a/gcc/expmed.c +++ b/g

[PATCH] Remove UNSPEC_{COPYSIGN,XORSIGN}.

2021-09-12 Thread liuhongt via Gcc-patches
Hi: UNSPEC_COPYSIGN/XORSIGN are only used by related post_reload splitters which have been removed by r12-3417 and r12-3435. Bootstrapped and regtest on x86_64-linux-gnu{-m32,}. Pushed to trunk. gcc/ChangeLog: * config/i386/i386.md: (UNSPEC_COPYSIGN): Remove. (UNSPEC_XORSI

[PATCH] Output vextract{i, f}{32x4, 64x2} for (vec_select:(reg:Vmode) idx) when byte_offset of idx % 16 == 0.

2021-09-14 Thread liuhongt via Gcc-patches
Hi: As describled in PR, use vextract instead on valign when byte_offset % 16 == 0. Bootstrapped and regtest on x86_64-linux-gnu{-m32,}. Pushed to trunk. 2020-09-13 Hongtao Liu Peter Cordes gcc/ChangeLog: PR target/91103 * config/i386/sse.md (extract_suf):

[PATCH] Optimize for V{8,16,32}HFmode vec_set/extract/init.

2021-09-15 Thread liuhongt via Gcc-patches
Hi: The optimization is decribled in PR. Bootstrapped and regtest on x86_64-linux-gnu{-m32,}. All avx512fp16 runtest cases passed on SPR. gcc/ChangeLog: PR target/102327 * config/i386/i386-expand.c (ix86_expand_vector_init_interleave): Use puncklwd to pack 2

[PATCH] Enable auto-vectorization at O2 with very-cheap cost model.

2021-09-15 Thread liuhongt via Gcc-patches
Ping rebased on latest trunk. gcc/ChangeLog: * common.opt (ftree-vectorize): Add Var(flag_tree_vectorize). * doc/invoke.texi (Options That Control Optimization): Update documents. * opts.c (default_options_table): Enable auto-vectorization at O2 with very-c

[PATCH] Check mask type when doing cond_op related gimple simplification.

2021-09-15 Thread liuhongt via Gcc-patches
Ping. Bootstrapped and regtest on x86_64-linux-gnu{-m32,}, aarch64-unknown-linux-gnu{-m32,} Ok for trunk? gcc/ChangeLog: PR middle-end/102080 * match.pd: Check mask type when doing cond_op related gimple simplification. * tree.c (is_truth_type_for): New funct

[PATCH] [AVX512FP16] Support embedded broadcast for AVX512FP16 instructions.

2021-09-16 Thread liuhongt via Gcc-patches
Bootstrapped and regtest on x86_64-pc-linux-gnu{-m32,}. Runtime tests passed under sde{-m32,}. gcc/ChangeLog: PR target/87767 * config/i386/i386.c (ix86_print_operand): Handle V8HF/V16HF/V32HFmode. * config/i386/i386.h (VALID_BCST_MODE_P): Add HFmode. *

[PATCH] [i386] Fix ICE in pass_rpad.

2021-09-17 Thread liuhongt via Gcc-patches
Besides conversion instructions, pass_rpad also handles scalar sqrt/rsqrt/rcp/round instructions, while r12-3614 should only want to handle conversion instructions, so fix it. Bootstrapped and regtest on x86_64-linux-gnu{-m32,} w/ configure --enable-checking=yes,rtl,extra, failed tests are fixed

[PATCH] Support 64bit fma/fms/fnma/fnms under avx512vl.

2021-09-21 Thread liuhongt via Gcc-patches
Hi: fma/fms/fnma/fnmsv2sf4 are defined only under (TARGET_FMA || TARGET_FMA4). The patch extend the expanders to TARGET_AVX512VL. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: * config/i386/mmx.md (fmav2sf4): Extend to AVX512 fma. (f

[PATCH] [i386] Adjust testcase.

2021-09-21 Thread liuhongt via Gcc-patches
Pushed to trunk. gcc/testsuite/ChangeLog: * gcc.target/i386/pr92658-avx512f.c: Refine testcase. * gcc.target/i386/pr92658-avx512vl.c: Adjust scan-assembler, only v2di->v2qi truncate is not supported, v4di->v4qi should be supported. --- gcc/testsuite/gcc.target/i38

[PATCH] wwwdocs: [GCC12] Mention Intel AVX512-FP16.

2021-09-22 Thread liuhongt via Gcc-patches
--- htdocs/gcc-12/changes.html | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/htdocs/gcc-12/changes.html b/htdocs/gcc-12/changes.html index 81f62fe3..14149212 100644 --- a/htdocs/gcc-12/changes.html +++ b/htdocs/gcc-12/changes.html @@ -165,8 +165,12 @@ a work-in-progre

[PATCH 0/7] AVX512FP16: Support bunch of expanders for HFmode and vector HFmodes

2021-09-22 Thread liuhongt via Gcc-patches
expander for smin/maxhf3. AVX512FP16: Add fix(uns)?_truncmn2 for HF scalar and vector modes AVX512FP16: Add float(uns)?mn2 expander AVX512FP16: add truncmn2/extendmn2 expanders AVX512FP16: Enable vec_cmpmn/vcondmn expanders for HF modes. liuhongt (2): AVX512FP16: Add expander for rint

[PATCH 1/7] AVX512FP16: Add expander for rint/nearbyinthf2.

2021-09-22 Thread liuhongt via Gcc-patches
gcc/ChangeLog: * config/i386/i386.md (rinthf2): New expander. (nearbyinthf2): New expander. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512fp16-builtin-round-1.c: Add new testcase. --- gcc/config/i386/i386.md | 22 +++ .../i386/avx

[PATCH 2/7] AVX512FP16: Add expander for fmahf4

2021-09-22 Thread liuhongt via Gcc-patches
gcc/ChangeLog: * config/i386/sse.md (FMAMODEM): extend to handle FP16. (VFH_SF_AVX512VL): Extend to handle HFmode. (VF_SF_AVX512VL): Deleted. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512fp16-fma-1.c: New test. * gcc.target/i386/avx512fp16vl-fma-1.c: N

[PATCH 3/7] AVX512FP16: Add expander for smin/maxhf3.

2021-09-22 Thread liuhongt via Gcc-patches
From: Hongyu Wang gcc/ChangeLog: * config/i386/i386.md (hf3): New expander. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512fp16-builtin-minmax-1.c: New test. --- gcc/config/i386/i386.md | 11 ++ .../i386/avx512fp16-builtin-minmax-1.c| 35 +++

[PATCH 4/7] AVX512FP16: Add fix(uns)?_truncmn2 for HF scalar and vector modes

2021-09-22 Thread liuhongt via Gcc-patches
From: Hongyu Wang NB: 64bit/32bit vectorize for HFmode is not supported for now, will adjust this patch when V2HF/V4HF operations supported. gcc/ChangeLog: * config/i386/i386.md (fix_trunchf2): New expander. (fixuns_trunchfhi2): Likewise. (*fixuns_trunchfsi2zext): New de

[PATCH 5/7] AVX512FP16: Add float(uns)?mn2 expander

2021-09-22 Thread liuhongt via Gcc-patches
From: Hongyu Wang gcc/ChangeLog: * config/i386/sse.md (float2): New expander. (avx512fp16_vcvt2ph_): Rename to ... (floatv4hf2): ... this, and drop constraints. (avx512fp16_vcvtqq2ph_v2di): Rename to ... (floatv2div2hf2): ... this, and like

[PATCH 6/7] AVX512FP16: add truncmn2/extendmn2 expanders

2021-09-22 Thread liuhongt via Gcc-patches
From: Hongyu Wang gcc/ChangeLog: * config/i386/sse.md (extend2): New expander. (extendv4hf2): Likewise. (extendv2hfv2df2): Likewise. (trunc2): Likewise. (avx512fp16_vcvt2ph_): Rename to ... (truncv4hf2): ... this, and drop constraints.

[PATCH 7/7] AVX512FP16: Enable vec_cmpmn/vcondmn expanders for HF modes.

2021-09-22 Thread liuhongt via Gcc-patches
From: Hongyu Wang gcc/ChangeLog: * config/i386/i386-expand.c (ix86_use_mask_cmp_p): Enable HFmode mask_cmp. * config/i386/sse.md (sseintvecmodelower): Add HF vector modes. (_store_mask): Extend to support HF vector modes. (vec_cmp): Likewise. (vcon

[PATCH] [GCC12] Mention Intel AVX512-FP16 and _Float16 support.

2021-09-23 Thread liuhongt via Gcc-patches
Updated, mention _Float16 support. --- htdocs/gcc-12/changes.html | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/htdocs/gcc-12/changes.html b/htdocs/gcc-12/changes.html index 81f62fe3..f19c6718 100644 --- a/htdocs/gcc-12/changes.html +++ b/htdocs/gcc-12/changes.

[PATCH] [GIMPLE] Simplify (_Float16) ceil ((double) x) to .CEIL (x) when available.

2021-09-24 Thread liuhongt via Gcc-patches
Hi: Related discussion in [1] and PR. Bootstrapped and regtest on x86_64-linux-gnu{-m32,}. Ok for trunk? [1] https://gcc.gnu.org/pipermail/gcc-patches/2021-July/574330.html gcc/ChangeLog: PR target/102464 * config/i386/i386.c (ix86_optab_supported_p): Return true f

[PATCH] [i386] Remove storage only description for _Float16 w/o avx512fp16.

2021-09-24 Thread liuhongt via Gcc-patches
[1] https://gcc.gnu.org/pipermail/gcc-patches/2021-September/580207.html gcc/ChangeLog: * doc/extend.texi (Half-Precision): Remove storage only description for _Float16 w/o avx512fp16. --- gcc/doc/extend.texi | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff

[PATCH] Enable auto-vectorization at O2 with very-cheap cost model.

2021-09-25 Thread liuhongt via Gcc-patches
Hi: > Please don't add the -fno- option to the warning tests.  As I said, > I would prefer to either suppress the vectorization for the failing > cases by tweaking the test code or xfail them.  That way future > regressions won't be masked by the option.  Once we've moved > the warning to a more su

[PATCH] Revert "Optimize v4sf reduction.".

2021-09-27 Thread liuhongt via Gcc-patches
Revert due to performace regression. This reverts commit 8f323c712ea76cc4506b03895e9b991e4e4b2baf. PR target/102473 PR target/101059 --- gcc/config/i386/sse.md| 39 ++- gcc/testsuite/gcc.target/i386/sse2-pr101059.c | 32 --- gcc/tests

<    1   2   3   4   5   6   >