from:"liuhongt"

[PATCH v3] Remove SPR/GNR/DMR from avx512_{move, store}_by pieces tune.

2025-09-17 Thread liuhongt

_store_by_pieces. Since they eventually have the same impact as just setting ix86_move_max. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} Ready push to trunk. Align move_max with prefer_vector_width for SPR/GNR/DMR similar as below commit. commit 6ea25c041964bf63014fcf7bb68fb1f5a0a4e123 Autho

[PATCH v2] Remove SPR/GNR/DMR from avx512_move_by_pieces tune.

2025-09-16 Thread liuhongt

From: "hongtao.liu" Update in V2: Only remove SPR/GNR/DMR from avx512_move_by_pieces. Align move_max with prefer_vector_width for SPR/GNR/DMR similar as below commit. commit 6ea25c041964bf63014fcf7bb68fb1f5a0a4e123 Author: liuhongt Date: Thu Aug 15 12:54:07 2024 +0800 A

[PATCH] Remove SPR/GNR/DMR from avx512_{move,store}_by pieces tune.

2025-09-15 Thread liuhongt

From: "hongtao.liu" Align move_max with prefer_vector_width for SPR/GNR/DMR to avoid STLF issue. It's similar as previous commit. commit 6ea25c041964bf63014fcf7bb68fb1f5a0a4e123 Author: liuhongt Date: Thu Aug 15 12:54:07 2024 +0800 Align ix86_{move_max,store_max}

[PATCH] [x86] Optimize vpermpd to vbroadcastf128 for specific permutations.

2025-09-14 Thread liuhongt

Broadcast from memory is better than load 128-bit vector + permutation to 256-bit vector. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/predicates.md (avx_vbroadcast128_operand): New predicate. * config/i386/ss

[PATCH v3] [x86] Exclude fake cross-lane permutation from avx256_avoid_vec_perm.

2025-09-11 Thread liuhongt

SLP may take a broadcast as kind of vec_perm, the patch checks the permutation index to exclude those false positive. > > > so the vectorizer costs sth withy count == 0? I'll see to fix that, > > > but this also > > > means the code should have used m_num_avx256_vec_perm[where] += count. Changed.

[PATCH v2 1/2] [x86] Exclude fake cross-lane permutation from avx256_avoid_vec_perm.

2025-09-05 Thread liuhongt

SLP may take a broadcast as kind of vec_perm, the patch checks the permutation index to exclude those false positive. > Btw, you can now (in some cases) do better, namely you should > always have 'node' available and when SLP_TREE_PERMUTE_P (node) > then SLP_TREE_LANE_PERMUTATION could be inspecte

[PATCH v2 2/2] [x86] Use vpermil{ps, pd} instead of vperm{d, q} when permutation is in-lane.

2025-09-04 Thread liuhongt

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/i386-expand.cc (expand_vec_perm_vpermil): Extend to handle V8SImode. (avx_vpermilp_parallel): Extend to handle vector integer modes with same vector size and

[PATCH] [x86] Fix ICE due to wrong operand is passed to ix86_vgf2p8affine_shift_matrix.

2025-08-30 Thread liuhongt

1) Fix predicate of operands[3] in cond_ since only const_vec_dup_operand is excepted for masked operations, and pass real count to ix86_vgf2p8affine_shift_matrix. 2) Pass operands[2] instead of operands[1] to gen_vgf2p8affineqb__mask which excepted the operand to shifted, but operands[1] is mask

[PATCH] Document -param=ix86-vect-unroll-limit.

2025-08-28 Thread liuhongt

Pushed as obvious. gcc/ChangeLog: * doc/invoke.texi: Document -param=ix86-vect-unroll-limit. --- gcc/doc/invoke.texi | 3 +++ 1 file changed, 3 insertions(+) diff --git a/gcc/doc/invoke.texi b/gcc/doc/invoke.texi index 56c4fa86e34..4e063e43c85 100644 --- a/gcc/doc/invoke.texi +++ b/gcc/

[PATCH] Fix _Decimal128 arithmetic error under FE_UPWARD.

2025-08-27 Thread liuhongt

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. libgcc/config/libbid/ChangeLog: PR target/120691 * bid128_div.c: Fix _Decimal128 arithmetic error under FE_UPWARD. * bid128_rem.c: Ditto. * bid128_sqrt.c: Ditto. * bid64_

[PATCH] Restrict avx256_avoid_vec_perm only for loop vectorization.

2025-08-26 Thread liuhongt

Since kind == vec_perm may not be a real vec_perm, just a broadcast or simple load in BB vectorizer. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/i386.cc (ix86_vector_costs::finish_cost): Restrict tune avx256_avoid_ve

[PATCH v2] [x86] Enable unroll in the vectorizer when there's reduction for FMA/DOT_PROD_EXPR/SAD_EXPR

2025-08-10 Thread liuhongt

> > The comment doesn't match the bool type. > Fixed. > > is_gimple_assign (stmt_info->stmt) > Changed. > There's also SAD_EXPR? The vectorizer has lane_reducing_op_p () > for this that also lists WIDEN_SUM_EXPR. Add SAD_EXPR since x86 supports usad{v16qi, v32qi, v64qi}. Not add WIDEN_SUM_EXPR s

[PATCH] [x86] Enable unroll in the vectorizer when there's reduction for FMA/DOT_PROD_EXPR

2025-07-29 Thread liuhongt

The patch is trying to unroll the vectorized loop when there're FMA/DOT_PRDO_EXPR reductions, it will break cross-iteration dependence and enable more parallelism(since vectorize will also enable partial sum). When there's gather/scatter or scalarization in the loop, don't do the unroll since the

[PATCH] Remove V64SFmode and V64SImode.

2025-07-29 Thread liuhongt

It's needed by avx5124vnniw/avx5124fmaps which have been removed by r15-656-ge1a7e2c54d52d0. Ready push to trunk after passing regression test. gcc/ChangeLog: * config/i386/i386-modes.def: Remove VECTOR_MODES(FLOAT, 256) and VECTOR_MODE (INT, SI, 64). * config/i386/i386.c

[PATCH] Eliminate redundant vpextrq/vpinsrq when move TI to V4SI.

2025-07-29 Thread liuhongt

r14-1902-g96c3539f2a3813 split TImode move with 2 DImode move, it's supposed to optimize TImode in parameter/return since accoring to psABI it's stored into 2 general registers. But when TImode is not in parameter/return, it could create redundancy in the PR. The patch add a splitter to handle th

[PATCH] Don't duplicate setup code cost when do group-candidate cost calucalution.

2025-06-23 Thread liuhongt

From: "hongtao.liu" - /* Uses in a group can share setup code, so only add setup cost once. */ - cost -= cost.scratch; It looks like the original code took into account avoiding double counting, but unfortunately cost is reset inside the follow loop which invalidates the upper code, and makes

[PATCH] [x86] [PR103750] Also handle avx512 kmask & immediate 15 or 3 when VF is 4/2.

2025-06-04 Thread liuhongt

like r16-105-g599bca27dc37b3, the patch handles redunduant clean up of upper-bits for maskload. .i.e Successfully matched this instruction: (set (reg:V4DF 175) (vec_merge:V4DF (unspec:V4DF [ (mem:V4DF (plus:DI (reg/v/f:DI 155 [ b ]) (reg:DI 143 [ ivtmp.56

[PATCH V2] For datarefs with big gap, split them into different groups.

2025-05-26 Thread liuhongt

> > It's https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119181 > > Please mention that in the changelog. Also ... Changed. > Please put this condition in the set of conds we test in the else branch of > ... > > > > /* Do not place the same access in the interleaving chain > > > twice.

[PATCH] [AUTOFDO] Don't scale bb_count with ipa_count when ipa_count is zero but count_max is not

2025-05-18 Thread liuhongt

From: "hongtao.liu" AutoFDO profile is a scaled profile, as a result, 0 sample does not mean never executed. especially there's profile from function body. Prevent combine_with_ipa_count·(ipa_count) from zeroing all bb->count. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} OK for trunk

[PATCH v3] Extend vect_recog_cond_expr_convert_pattern to handle REAL_CST

2025-05-18 Thread liuhongt

Changed, here's the updated patch I'm going to check in. REAL_CST is handled if it can be represented in different floating point types without loss of precision or under fast math. gcc/ChangeLog: PR tree-optimization/103771 * match.pd (cond_expr_convert_p): Extend the match to h

[PATCH] For datarefs with big gap, split them into different groups.

2025-05-15 Thread liuhongt

The patch tries to solve miss vectorization for below case. void foo (int* a, int* restrict b) { b[0] = a[0] * a[64]; b[1] = a[65] * a[1]; b[2] = a[2] * a[66]; b[3] = a[67] * a[3]; b[4] = a[68] * a[4]; b[5] = a[69] * a[5]; b[6] = a[6] * a[70]; b[7] = a[7] * a[71]; }

[PATCH] Add pattern match in match.pd for .AVG_CEIL

2025-05-15 Thread liuhongt

1) Optimize (a >> 1) + (b >> 1) + ((a | b) & 1) to .AVG_CEIL (a, b) 2) Optimize (a | b) - ((a ^ b) >> 1) to .AVG_CEIL (a, b) Prof is at https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118994#c6 Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: PR middle

[PATCH v3] Extend vect_recog_cond_expr_convert_pattern to handle REAL_CST

2025-05-13 Thread liuhongt

So it won't do the unsafe truncation for double(1.001) to float(1.0) since there's precision loss. It's guarded by testcase pr103771-6.c Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? REAL_CST is handled if it can be represented in different floating point typ

[PATCH v3] Consider frequency in cost estimation when converting scalar to vector.

2025-05-13 Thread liuhongt

Update in V3 > > > + cost_sse_integer = 0; > > > + weighted_cost_sse_integer = 0 ; > Extra space here. Changed. > > > + : ix86_size_cost.sse_to_integer; > > Please be sure to not revert the changes from my patch adding > COSTS_N_INSNS (...) / 2 > here and some other places. Yes, keep the

[PATCH] Update libbid according to the latest Intel Decimal Floating-Point Math Library.

2025-05-13 Thread liuhongt

The Intel Decimal Floating-Point Math Library is available as open-source on Netlib[1]. [1] https://www.netlib.org/misc/intel/ Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. libgcc/config/libbid/ChangeLog: * bid128_string.c (MIN_DIGITS): New macro.

[PATCH v2 1/2] Extend vect_recog_cond_expr_convert_pattern to handle floating point type.

2025-05-12 Thread liuhongt

Updated in V2 > > Can you instead of mangling in float support use separate (match like > for the below cases? I tried, but reported duplicated defination since they share same pattern like (cond (simple_comparison@6 @0 @1) (convert@4 @2) (convert@5 @3)) No idea how to split that. > > > @@ -1130

[PATCH v2 2/2] Extend vect_recog_cond_expr_convert_pattern to handle REAL_CST

2025-05-12 Thread liuhongt

REAL_CST is handled if it can be represented in different floating point types without loss of precision or under fast math. gcc/ChangeLog: PR tree-optimization/103771 * match.pd (cond_expr_convert_p): Extend the match to handle REAL_CST. * tree-vect-patterns.cc

[PATCH v3] Consider frequency in cost estimation when converting scalar to vector.

2025-05-08 Thread liuhongt

The only part I changed is related to size_cost of sse_to_ineteger, as below 114+ /* Under TARGET_SSE4_1, it's vmovd + vpextrd/vpinsrd. 115+ W/o it, it's movd + psrlq/unpckldq + movd. */ 116+ else if (!TARGET_64BIT && smode != SImode) 117+cost *= TARGET_SSE4_1 ? 2 : 3; 118+ Ok for trun

[V2 PATCH] Fix name mismatch for fortran.

2025-05-07 Thread liuhongt

From: "hongtao.liu" > The check you added seems correct to me. Do we need to keep the > afdo_string_table->get_index (IDENTIFIER_POINTER ( > DECL_ASSEMBLER_NAME (edge->callee->decl))) != s->name () > check? Should your check replace it rather than be an additional check? I verified t

[PATCH V3] [autofdo] Annotate empty bb with all debug_stmt with location of phi in the single_succ.

2025-04-29 Thread liuhongt

From: "hongtao.liu" > another thing, you can save the walk over PHI args by using > > gimple_phi_arg_location (phi, tmp_e->dest_idx); > Changed, use gimple_phi_arg_location_from_edge (phi, tmp_e); For an empty BB with all debug_stmt, it will be ignored by afdo_set_bb

[PATCH v2] Consider frequency in cost estimation when converting scalar to vector.

2025-04-28 Thread liuhongt

> I am generally trying to get rid of remaing uses of REG_FREQ since the > 1 based fixed point arithmetics iot always working that well. > > You can do the sums in profile_count type (doing something reasonable > when count is uninitialized) and then convert it to sreal for the final > heuristi

[PATCH] Remove other processors from X86_TUNE_DEST_FALSE_DEP_FOR_GLC except GLC

2025-04-28 Thread liuhongt

Since the tune if only for GLC(sapphirerapids and alderlake-P). Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk and backport to GCC15/GCC14/GCC13 release branches. gcc/ChangeLog: * config/i386/x86-tune.def (X86_TUNE_DEST_FALSE_DEP_FOR_GLC): Remove ot

[PATCH] Extend vect_recog_cond_expr_convert_pattern to handle floating point type.

2025-04-28 Thread liuhongt

For floating point, !flag_trapping_math is needed for the pattern which transforms 2 conversions to 1 conversion, and may lose 1 potential trap. There shouldn't be any accuracy issue. It also handles real_cst if it can be represented in different floating point types without loss of precision. Bo

[PATCH v2] [autofdo] Annotate empty bb with all debug_stmt with location of phi in the single_succ.

2025-04-28 Thread liuhongt

From: "hongtao.liu" > I think the comment is a bit off, it should be "For an empty BB ..." since > we should not change behavior on whether there are debug stmts or not. Changed. For an empty BB with all debug_stmt, it will be ignored by afdo_set_bb_count, but it can be set with count of single

[PATCH] [autofdo] Annotate bb with all debug_stmt with location of phi in the single_succ.

2025-04-27 Thread liuhongt

From: "hongtao.liu" For BB with all debug_stmt, it will be ignored by afdo_set_bb_count, but it can be set with count of single successors PHIs which edge from the BB.(only nonzero count is annotatted). Tested with -march=x86-64-v3 -O2 autofdo enabled, the issue in the PR is fixed. Bootstrapped

[PATCH] Fix name mismatch for fortran.

2025-04-27 Thread liuhongt

From: "hongtao.liu" Function name in afdo_string_table is step3d_t_tile. but DECL_ASSEMBLER_NAME (edge->callee->decl))) gets __step3d_t_mod_MOD_step3d_t_tile, Looks like the prefix is not in the debug string table, so let's also check directly for afdo_string_table->get_index_by_decl (edge->calle

[PATCH] Refactor msse4 and mno-sse4.

2025-04-24 Thread liuhongt

This is originally from [1] For the command line, or target attribute, the actual operation goes into ix86_handle_option, and as long as we get it right in this ix86_handle_option, everything else should be fine. As for the macros generated by the mask name (TARGET_SSE4_1_P), their mea

[PATCH] target: [PR103750] Also handle avx512 kmask & immediate 15 or 3 when VF is 4/2.

2025-04-22 Thread liuhongt

cat test.c void foo () { __mmask8 mask1 = _mm_cmpeq_epu32_mask (pi128[0], pi128[1]); a = mask1 & 15; } with -O2 -march=x86-64-v4, gcc generates foo(): movqpi128(%rip), %rax vmovdqa (%rax), %xmm0 vpcmpeqd16(%rax), %xmm0, %k0 kmovb %k0, %eax

[PATCH] Accept allones or 0 operand for vcond_mask op1.

2025-04-20 Thread liuhongt

Since ix86_expand_sse_movcc will simplify them into a simple vmov, vpand or vpandn. Current register_operand/vector_operand could lose some optimization opportunity. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog: * config/i386/predicates.md (vector

[PATCH] [x86] Generate 2 FMA instructions in ix86_expand_swdivsf.

2025-04-20 Thread liuhongt

From: "hongtao.liu" When FMA is available, N-R step can be rewritten with a / b = (a - (rcp(b) * a * b)) * rcp(b) + rcp(b) * a which have 2 fma generated.[1] [1] https://bugs.llvm.org/show_bug.cgi?id=21385 Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? gcc/ChangeLog

[PATCH] Consider frequency in cost estimation when converting scalar to vector.

2025-04-17 Thread liuhongt

In some benchmark, I notice stv failed due to cost unprofitable, but the igain is inside the loop, but sse<->integer conversion is outside the loop, current cost model doesn't consider the frequency of those gain/cost. The patch weights those cost with frequency just like LRA does. Bootstrapped a

[PATCH] Revert documents from r11-344-g0fec3f62b9bfc0

2025-04-13 Thread liuhongt

Look like those operand modifiers are only for internal usage in .md files, so for simplicity, I'll just remove them from extend.texi. Ready push to trunk. gcc/ChangeLog: PR documentation/108134 * doc/extend.texi: Remove documents from r11-344-g0fec3f62b9bfc0. --- gcc/doc/extend

[PATCH] Use ix86_fp_comparison_operator in cbranchbf4 to avoid ICE.

2025-03-18 Thread liuhongt

*jcc only supports ix86_fp_comparison_operator for CCFP, when comparison code is LT, there's an ICE. W/o AVX10.2, it's ok since do_compare_rtx_and_jump will transform LT to GT, but w/ AVX10.2 it goes directly into ix86_expand_branch which doesn't handle it. Use ix86_fp_comparison_operator in cbran

[PATCH] [testsuite] Mark gcc.target/i386/apx-ndd-tls-1b.c as xfail.

2025-03-16 Thread liuhongt

It looks like the testcase is fragile, it's supposed to check the compiler ability of generating code_6_gottpoff_reloc instruction, but failed since there's a seg_prefixed memory usage(r14-6242-gd564198f960a2f). mov r13, QWORD PTR j@gottpoff[rip] mov r12, QWORD PTR a@gottpo

[PATCH 3/3] Adjust testcases after better RA decision.

2025-02-09 Thread liuhongt

After optimization for RA, memory op is not propagated into instructions(>1), and it make testcases not generate vxorps since the memory is loaded into the dest, and the dest is never unused now. So rewrite testcases to make the codegen more stable. gcc/testsuite/ChangeLog: * gcc.target/

[PATCH 1/3] Use NO_REGS in cost calculation when the preferred register class are not known yet.

2025-02-09 Thread liuhongt

gcc/ChangeLog: PR rtl-optimization/108707 * ira-costs.cc (scan_one_insn): Use NO_REGS instead of GENERAL_REGS when preferred reg_class is not known. gcc/testsuite/ChangeLog: * gcc.target/i386/pr108707.c: New test. (cherry picked from commit 0368d169492017cfab5622

[PATCH 2/3] Only use NO_REGS in cost calculation when !hard_regno_mode_ok for GENERAL_REGS and mode.

2025-02-09 Thread liuhongt

r14-172-g0368d169492017 replaces GENERAL_REGS with NO_REGS in cost calculation when the preferred register class are not known yet. It regressed powerpc PR109610 and PR109858, it looks too aggressive to use NO_REGS when mode can be allocated with GENERAL_REGS. The patch takes a step back, still use

[PATCH 0/3] GCC13/GCC12 backport [PR108707][PR109610]

2025-02-09 Thread liuhongt

2 and r14-1252 to GCC13 and GCC12 release branch. Note r14-1252 is a fix to r14-172 which regressed powerpc testcase in PR109610. I have verified the fix also works on GCC13/GCC12 branch for PR109610. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}, and aarch64-linux-gnu. Ok for backport

[PATCH] [x86][avx512] Fix typo to avoid ICE.

2025-01-15 Thread liuhongt

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/118489 * config/i386/sse.md (VF1_AVX512BW): Fix typo. gcc/testsuite/ChangeLog: * gcc.target/i386/pr118489.c: New test. --- gcc/config/i386/sse.md |

[PATCH] Refactor ix86_expand_vecop_qihi2.

2025-01-09 Thread liuhongt

Since there's regression to use vpermq, and it's manually disabled by !TARGET_AVX512BW. I remove the codes related to vpermq and make ix86_expand_vecop_qihi2 only handle vpmovbw + op + vpmovwb case. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready to push to trunk. gcc/ChangeLog:

[PATCH V2] Fix inaccuracy in cunroll/cunrolli when considering what's innermost loop.

2024-12-09 Thread liuhongt

> Please pass 'sbitmap' instead of auto_sbitmap&, it should properly > decay to that. Applies everywhere I think. > Changed. > In fact I wonder whether we should simply populate the bitmap > from a > > for (auto loop : loops_list (cfun, LI_ONLY_INNERMOST)) > bitmap_set_bit (original_innerm

[PATCH] Fix inaccuracy in cunroll/cunrolli when considering what's innermost loop.

2024-12-05 Thread liuhongt

r15-919-gef27b91b62c3aa removed 1 / 3 size reduction for innermost loop, but it doesn't accurately remember what's "innermost" for 2 testcases in PR117888. 1) For pass_cunroll, the "innermost" loop could be an originally outer loop with inner loop completely unrolled by cunrolli. The patch moves l

[PATCH] [x86] [RFC] Prevent loop vectorization if it's in a deeply nested big loop.

2024-11-26 Thread liuhongt

When loop requires any kind of versioning which could increase register pressure too much, and it's in a deeply nest big loop, don't do vectorization. I tested the patch with both Ofast and O2 for SPEC2017, besides 548.exchange_r, other benchmarks are same binary. Bootstrapped and regtested 0on x

[PATCH] [x86] Fix uninitialized operands[2] in vec_unpacks_hi_v4sf.

2024-11-22 Thread liuhongt

It could cause weired spill in RA when register pressure is high. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ok for trunk? BTW, It's difficult to get a decent testcase for the issue since the spill is not exposed in simple testcase. gcc/ChangeLog: PR target/117562

[PATCH] Guard truncate from vector float to vector __bf16 with !flag_rounding_math && HONOR_NANS (BFmode).

2024-11-07 Thread liuhongt

hw instruction doesn't raise exceptions, turns sNAN into qNAN quietly, and always round to nearest (even). Output denormals are always flushed to zero and input denormals are always treated as zero. MXCSR is not consulted nor updated. W/o native instructions, flag_unsafe_math_optimizations is neede

[PATCH] Make ix86_align_loops uarch-specific tune.

2024-11-06 Thread liuhongt

Disable the tune for Zhaoxin/CLX/SKX since it could hurt performance for the inner loop. According to last test, align_loop helps performance for SPEC2017 on EMR and Znver4. So I'll still keep the tune for generic part. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Any comment? gcc/

[PATCH] Fix ICE due to subreg:us_truncate.

2024-10-29 Thread liuhongt

Force_operand issues an ICE when input is (subreg:DI (us_truncate:V8QI)), it's probably because it's an invalid rtx, So refine backend patterns for that. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/117318 * config/i386/s

[PATCH 2/2] Support vector float_extend from __bf16 to float.

2024-10-29 Thread liuhongt

It's supported by vector permutation with zero vector. gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_vector_bf2sf_with_vec_perm): New function. * config/i386/i386-protos.h (ix86_expand_vector_bf2sf_with_vec_perm): New Declare. * config/i386/mmx.m

[PATCH 1/2] [x86] Support vector float_truncate for SF to BF.

2024-10-29 Thread liuhongt

Generate native instruction whenever possible, otherwise use vector permutation with odd indices. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/i386-expand.cc (ix86_expand_vector_sf2bf_with_vec_perm): New function.

[PATCH] [x86] Fix ICE due to isa mismatch for the builtins.

2024-10-22 Thread liuhongt

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk and backport to release branch. gcc/ChangeLog: PR target/117240 * config/i386/i386-builtin.def: Add avx/avx512f to vaes ymm/zmm builtins. gcc/testsuite/ChangeLog: * gcc.target/i386/pr11

[PATCH] i386: Optimize EQ/NE comparison between avx512 kmask and -1.

2024-10-21 Thread liuhongt

r15-974-gbf7745f887c765e06f2e75508f263debb60aeb2e has optimized for jcc/setcc, but missed movcc. The patch supports movcc. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/117232 * config/i386/sse.md (*kortest_cmp_movqicc):

[PATCH] [GCC13/GCC12] Fix testcase.

2024-10-21 Thread liuhongt

The optimization relies on other patterns which are only available at GCC14 and obove, so restore the xfail for GCC13/12 branch. Pushed as an obvious fix. gcc/testsuite/ChangeLog: * gcc.target/i386/avx512bw-pr103750-2.c: Add xfail for ia32. --- gcc/testsuite/gcc.target/i386/avx512bw-pr1

[PATCH] [AVX512] Refine splitters related to "combine vpcmpuw + zero_extend to vpcmpuw"

2024-10-16 Thread liuhongt

r12-6103-g1a7ce8570997eb combines vpcmpuw + zero_extend to vpcmpuw with the pre_reload splitter, but the splitter transforms the zero_extend into a subreg which make reload think the upper part is garbage, it's not correct. The patch adjusts the zero_extend define_insn_and_split to define_insn to

[PATCH] Adjust testcase to avoid scan FIX in REG_EQUIV.

2024-10-15 Thread liuhongt

Also add hard_float target to avoid failed on arm-eabi, cortex-m0. Verified on cross-compiler for powerpc64le-linux-gnu, sparc-sun-solaris2.11 Ready push to trunk. gcc/testsuite/ChangeLog: PR testsuite/115365 * gcc.dg/pr100927.c: Adjust testcase to avoid scan FIX in REG_EQUIV. -

[PATCH][wwwdoc] Mention O2 vectorization enhancement.

2024-10-14 Thread liuhongt

--- htdocs/gcc-15/changes.html | 10 ++ 1 file changed, 10 insertions(+) diff --git a/htdocs/gcc-15/changes.html b/htdocs/gcc-15/changes.html index 6dc46a52..8a238256 100644 --- a/htdocs/gcc-15/changes.html +++ b/htdocs/gcc-15/changes.html @@ -36,6 +36,16 @@ a work-in-progress. General

[PATCH 2/2] [x86] Canonicalize (vec_merge (fma: op2 op1 op3) (match_dup 1)) mask) to (vec_merge (fma: op1 op2 op3) (match_dup 1)) mask)

2024-10-14 Thread liuhongt

For masked FMA, there're 2 forms of RTL representation 1) (vec_merge (fma: op2 op1 op3) op1) mask) 2) (vec_merge (fma: op1 op2 op3) op1) mask) It's because op1 op2 are communatative in RTL(the second op1 is written as (match_dup 1)) we once tried to replace (match_dup 1) with (match_operand:VFH_AV

[PATCH 1/2] [Middle-end] Canonicalize (vec_merge (fma op2 op1 op3) op1 mask) to (vec_merge (fma op1 op2 op3) op1 mask).

2024-10-14 Thread liuhongt

For x86 masked fma, there're 2 rtl representations 1) (vec_merge (fma op2 op1 op3) op1 mask) 2) (vec_merge (fma op1 op2 op3) op1 mask). 5894(define_insn "_fmadd__mask" 5895 [(set (match_operand:VFH_AVX512VL 0 "register_operand" "=v,v") 5896(vec_merge:VFH_AVX512VL 5897 (fma:VF

[PATCH 0/2] Canonicalize (vec_merge (fma op1 op2 op3) op1 mask) to (vec_merge (fma op1 op2 op3) op1 mask)

2024-10-14 Thread liuhongt

diate_operand" "0")) to enable more flexibility for pattern match and recog, but it triggered an ICE in reload(reload can handle at most one perand with "0" constraint). So we need either add 2 patterns in the backend or just do the canonicalization in the middle-end. The

[PATCH v3 2/2] Adjust testcase after relax O2 vectorization.

2024-10-08 Thread liuhongt

Update in V3. >The testcase looks bogus: > > b[i+k] = b[i+k-5] + 2; > >accesses b[-3], can you instead adjust the inner loop to start with k == 4? Changed, also adjust b[100] to b[200] to avoid array out of bound. >Please remove this testcase - even with fully masking we'd need alias >versi

[PATCH v3 1/2] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.

2024-10-08 Thread liuhongt

>We'd also need to update the documentation: >... The @samp{very-cheap} model only >allows vectorization if the vector code would entirely replace the >scalar code that is being vectorized. For example, if each iteration >of a vectorized loop would only be able to handle exactly four iterations >

[PATCH] Don't lower vpcmpu to pcmpgt since the latter is for signed comparison.

2024-10-08 Thread liuhongt

r15-1737-gb06a108f0fbffe lower AVX512 kmask comparison to AVX2 ones, but wrong lowered unsigned comparison to signed ones, for unsigned comparison, only EQ/NEQ can be lowered. The commit fix that. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog:

[PATCH 1/2] [x86] Add new microarchitecture tune for SRF/GRR/CWF.

2024-10-08 Thread liuhongt

For Crestmont, 4-operand vex blendv instructions come from MSROM and is slower than 3-instructions sequence (op1 & mask) | (op2 & ~mask). legacy blendv instruction can still be handled by the decoder. The patch add a new tune which is enabled for all processors except for SRF/CWF. It will use vpan

[PATCH 2/2] [x86] Add a new tune avx256_avoid_vec_perm for SRF.

2024-10-08 Thread liuhongt

According to Intel SOM[1], For Crestmont, most 256-bit Intel AVX2 instructions can be decomposed into two independent 128-bit micro-operations, except for a subset of Intel AVX2 instructions, known as cross-lane operations, can only compute the result for an element by utilizing one or more source

[PATCH 0/2] Enable more SRF tuning

2024-10-08 Thread liuhongt

ped and regtested on x86_64-pc-linux-gnu{-m32,}. The patch generally improves SPEC2017 allrate geomean by 1% with -march=sierraforest -Ofast on SRF. Ready push to trunk. liuhongt (2): [x86] Add new microarchitecture tune for SRF/GRR/CWF. [x86] Add a new tune avx256_avoid_vec_perm for SRF.

[PATCH v2 2/2] Adjust testcase after relax O2 vectorization.

2024-10-08 Thread liuhongt

gcc/testsuite/ChangeLog: * gcc.dg/fstack-protector-strong.c: Adjust scan-assembler-times. * gcc.dg/graphite/scop-6.c: Add -Wno-aggressive-loop-optimizations. * gcc.dg/graphite/scop-9.c: Ditto. * gcc.dg/tree-ssa/ivopts-lt-2.c: Add -fno-tree-vectorize.

[PATCH v2 1/2] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.

2024-10-08 Thread liuhongt

>So should we adjust very-cheap to allow niter peeling as proposed or >should we switch the default at -O2 to cheap? I prefer the former. Update in V2: Adjust testcase after relax O2 vectorization. Ok for trunk? gcc/ChangeLog: * tree-vect-loop.cc (vect_analyze_loop_costing): Enable

[PATCH] [x86] Define VECTOR_STORE_FLAG_VALUE

2024-09-24 Thread liuhongt

Return constm1_rtx when GET_MODE_CLASS (MODE) == MODE_VECTOR_INT. Otherwise NULL_RTX. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: * config/i386/i386.h (VECTOR_STORE_FLAG_VALUE): New macro. gcc/testsuite/ChangeLog: * gcc.dg/rtl/x8

[RFC PATCH] Enable vectorization for unknown tripcount in very cheap cost model but disable epilog vectorization.

2024-09-10 Thread liuhongt

GCC12 enables vectorization for O2 with very cheap cost model which is restricted to constant tripcount. The vectorization capacity is very limited w/ consideration of codesize impact. The patch extends the very cheap cost model a little bit to support variable tripcount. But still disable peel

[PATCH] Enable tune fuse_move_and_alu for GNR/GNR-D.

2024-09-10 Thread liuhongt

According to Intel Software Optimization Manual[1], the Redwood cove microarchitecture supports LD+OP and MOV+OP macro fusions. The patch enables MOV+OP tune for GNR. [1] https://www.intel.com/content/www/us/en/content-details/814198/intel-64-and-ia-32-architectures-optimization-reference-manual

[PATCH] Don't force_reg operands[3] when it's not const0_rtx.

2024-09-08 Thread liuhongt

It fix the regression by a51f2fc0d80869ab079a93cc3858f24a1fd28237 is the first bad commit commit a51f2fc0d80869ab079a93cc3858f24a1fd28237 Author: liuhongt Date: Wed Sep 4 15:39:17 2024 +0800 Handle const0_operand for *avx2_pcmp3_1. caused FAIL: gcc.target/i386/pr59539-1.c scan-assembler

[PATCH] Handle const0_operand for *avx2_pcmp3_1.

2024-09-04 Thread liuhongt

*_eq3_1 supports nonimm_or_0_operand for op1 and op2, pass_combine would fail to lower avx512 comparision back to avx2 one when op1/op2 is const0_rtx. It's because the splitter only support nonimmediate_operand. Failed to match this instruction: (set (reg/i:V16QI 20 xmm0) (vec_merge:V16QI (con

[PATCH] [x86] Check avx upper register for parallel.

2024-08-29 Thread liuhongt

> Can the above loop be a part of ix86_check_avx_upper_register, so this > function would scan the full RTX for avx upper register? Changed, also adjust ix86_check_avx_upper_stores and ix86_avx_u128_mode_needed to either inline the old ix86_check_avx_upper_register or replace FOR_EACH_SUBRTX with

[PATCH] [x86] Check avx upper register for parallel.

2024-08-29 Thread liuhongt

For function arguments/return, when it's BLK mode, it's put in a parallel with an expr_list, and the expr_list contains the real mode and registers. Current ix86_check_avx_upper_register only checked for SSE_REG_P, and failed to handle that. The patch extend the handle to each subrtx. Bootstrapped

[PATCH v2 1/2] Enhance cse_insn to handle all-zeros and all-ones for vector mode.

2024-08-26 Thread liuhongt

> You are possibly overwriting src_related_elt - I'd suggest to either break > here or do the loop below for each found elt? Changed. > Do we know that will always succeed? 1) validate_subreg allows subreg for 2 vector modes with same component modes. 2) gen_lowpart in cse.cc is defined as gen_low

[PATCH v2 2/2] [x86] Update ix86_mode_tieable_p and ix86_rtx_costs.

2024-08-26 Thread liuhongt

For mode2 bigger than 16-bytes, when it can be allocated to FIRST_SSE_REGS, then it can only be allocated to ALL_SSE_REGS, and it can be tiebale to all mode1 with smaller size which is available to FIRST_SSE_REGS. When modes is equal to 16 bytes, exclude non-vector modes(TI/TFmode). This is need fo

[PATCH 1/2] Enhance cse_insn to handle all-zeros and all-ones for vector mode.

2024-08-26 Thread liuhongt

Also try to handle redundant broadcasts when there's already a broadcast to a bigger mode with exactly the same component value. For broadcast, component mode needs to be the same. For all-zeros/ones, only need to check the bigger mode. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,} and

[PATCH 2/2] [x86] Update ix86_mode_tieable_p and ix86_rtx_costs.

2024-08-26 Thread liuhongt

For mode2 bigger than 16-bytes, when it can be allocated to FIRST_SSE_REGS, then it can only be allocated to ALL_SSE_REGS, and it can be tiebale to all mode1 with smaller size which is available to FIRST_SSE_REGS. When modes is equal to 16 bytes, exclude non-vector modes(TI/TFmode). This is need fo

[GCC13/GCC12 PATCH] Fix testcase failure.

2024-08-21 Thread liuhongt

Looks like -mprefer-vector-width=128 doesn't impact store_max/mov_max for GCC13/GCC12 branch, explicitly use -mmov-max=128, -mstore-max=128 for those testcases. Committed as an obvious fix. gcc/testsuite/ChangeLog: * gcc.target/i386/pieces-memcpy-10.c: Use -mmove-max=256 and -mst

[PATCH] Align ix86_{move_max,store_max} with vectorizer.

2024-08-20 Thread liuhongt

When none of mprefer-vector-width, avx256_optimal/avx128_optimal, avx256_store_by_pieces/avx512_store_by_pieces is specified, GCC will set ix86_{move_max,store_max} as max available vector length except for AVX part. if (TARGET_AVX512F_P (opts->x_ix86_isa_flags) &&

[PATCH] Align predicates for operands[1] between mov and *mov_internal.

2024-08-20 Thread liuhongt

>From [1] > > It's not obvious to me why movv16qi requires a nonimmediate_operand > > source, especially since ix86_expand_vector_mode does have code to > > cope with constant operand[1]s. emit_move_insn_1 doesn't check the > > predicates anyway, so the predicate will have little effect. > > > > A

[PATCH v2] [x86] Movement between GENERAL_REGS and SSE_REGS for TImode doesn't need secondary reload.

2024-08-15 Thread liuhongt

It results in 2 failures for x86_64-pc-linux-gnu{\ -march=cascadelake}; gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1 For pr113560.c, now GCC generates mulx instead of mulq with -march=cascadelake, which should be optimal,

[PATCH] [x86] Movement between GENERAL_REGS and SSE_REGS for TImode doesn't need secondary reload.

2024-08-13 Thread liuhongt

It results in 2 failures for x86_64-pc-linux-gnu{\ -march=cascadelake}; gcc: gcc.target/i386/extendditi3-1.c scan-assembler cqt?o gcc: gcc.target/i386/pr113560.c scan-assembler-times \tmulq 1 For pr113560.c, now GCC generates mulx instead of mulq with -march=cascadelake, which should be optimal,

[PATCH] Move ix86_align_loops into a separate pass and insert the pass after pass_endbr_and_patchable_area.

2024-08-12 Thread liuhongt

> Are there any assumptions that BB_HEAD must be a note or label? > Maybe we should move ix86_align_loops into a separate pass and insert > the pass just before pass_final. The patch inserts .p2align after endbr pass, it can also fix the issue. Bootstrapped and regtested on x86_64-pc-linux-gnu{-m3

[PATCH] [x86] Mention _Float16 and __bf16 changes in GCC14.

2024-07-30 Thread liuhongt

Ok for trunk? --- htdocs/gcc-14/changes.html| 7 +++ htdocs/gcc-14/porting_to.html | 9 + 2 files changed, 16 insertions(+) diff --git a/htdocs/gcc-14/changes.html b/htdocs/gcc-14/changes.html index ca4cae0f..b023a4b9 100644 --- a/htdocs/gcc-14/changes.html +++ b/htdocs/gcc-14/ch

[PATCH] Fix mismatch between constraint and predicate for ashl3_doubleword.

2024-07-29 Thread liuhongt

(insn 98 94 387 2 (parallel [ (set (reg:TI 337 [ _32 ]) (ashift:TI (reg:TI 329) (reg:QI 521))) (clobber (reg:CC 17 flags)) ]) "test.c":11:13 953 {ashlti3_doubleword} is reloaded into (insn 98 452 387 2 (parallel [ (se

[PATCH] Fix mismatch between constraint and predicate for ashl3_doubleword.

2024-07-25 Thread liuhongt

(insn 98 94 387 2 (parallel [ (set (reg:TI 337 [ _32 ]) (ashift:TI (reg:TI 329) (reg:QI 521))) (clobber (reg:CC 17 flags)) ]) "test.c":11:13 953 {ashlti3_doubleword} is reloaded into (insn 98 452 387 2 (parallel [ (se

[PATCH] [x86]Refine constraint "Bk" to define_special_memory_constraint.

2024-07-24 Thread liuhongt

For below pattern, RA may still allocate r162 as v/k register, try to reload for address with leaq __libc_tsd_CTYPE_B@gottpoff(%rip), %rsi which result a linker error. (set (reg:DI 162) (mem/u/c:DI (const:DI (unspec:DI [(symbol_ref:DI ("a") [flags 0x60] )]

[PATCH] Relax ix86_hardreg_mov_ok after split1.

2024-07-22 Thread liuhongt

ix86_hardreg_mov_ok is added by r11-5066-gbe39636d9f68c4 >The solution proposed here is to have the x86 backend/recog prevent >early RTL passes composing instructions (that set likely_spilled hard >registers) that they (combine) can't simplify, until after reload. >We allow sets fr

[PATCH v2] [x86][avx512] Optimize maskstore when mask is 0 or -1 in UNSPEC_MASKMOV

2024-07-17 Thread liuhongt

> Also, in case the insn is deleted, do: > > emit_note (NOTE_INSN_DELETED); > > DONE; > > instead of leaving (const_int 0) in the stream. > > So, the above insn preparation statements should read: > > --cut here-- > if (constm1_operand (operands[2], mode)) > emit_move_insn (operands[0], operands[

[PATCH] [x86][avx512] Optimize maskstore when mask is 0 or -1 in UNSPEC_MASKMOV

2024-07-16 Thread liuhongt

Bootstrapped and regtested on x86_64-pc-linux-gnu{-m32,}. Ready push to trunk. gcc/ChangeLog: PR target/115843 * config/i386/predicates.md (const0_or_m1_operand): New predicate. * config/i386/sse.md (*_store_mask_1): New pre_reload define_insn_and_split.

1 2 3 4 5 6 7 >

1 - 100 of 627 matches

Mail list logo