[PATCH v1] LoongArch: testsuite: Fix gcc.dg/vect/vect-reduc-mul_{1, 2}.c FAIL.

2024-02-01 Thread Li Wei
This FAIL was introduced from r14-6908. The reason is that when merging constant vector permutation implementations, the 128-bit matching situation was not fully considered. In fact, the expansion of 128-bit vectors after merging only supports value-based 4 elements set shuffle, so this time is a c

[PATCH v2 1/2] LoongArch: Redundant sign extension elimination optimization.

2024-01-11 Thread Li Wei
We found that the current combine optimization pass in gcc cannot handle the following redundant sign extension situations: (insn 77 76 78 5 (set (reg:SI 143) (plus:SI (subreg/s/u:SI (reg/v:DI 104 [ len ]) 0) (const_int 1 [0x1]))) {addsi3} (expr_list:REG_DEAD (reg/v:DI 104

[PATCH v2 2/2] LoongArch: Redundant sign extension elimination optimization 2.

2024-01-11 Thread Li Wei
Eliminate the redundant sign extension that exists after the conditional move when the target register is SImode. gcc/ChangeLog: * config/loongarch/loongarch.cc (loongarch_expand_conditional_move): Adjust. gcc/testsuite/ChangeLog: * gcc.target/loongarch/sign-extend-2.c:

[PATCH v1] LoongArch: Adjust cost of vector_stmt that match multiply-add pattern.

2024-01-24 Thread Li Wei
We found that when only 128-bit vectorization was enabled, 549.fotonik3d_r failed to vectorize effectively. For this reason, we adjust the cost of 128-bit vector_stmt that match the multiply-add pattern to facilitate 128-bit vectorization. The experimental results show that after the modification,

[PATCH v1] LoongArch: Optimize implementation of single-precision floating-point approximate division.

2024-01-24 Thread Li Wei
We found that in the spec17 521.wrf program, some loop invariant code generated from single-precision floating-point approximate division calculation failed to propose a loop. This is because the pseudo-register that stores the intermediate temporary calculation results is rewritten in the implemen

[PATCH v2] LoongArch: Adjust cost of vector_stmt that match multiply-add pattern.

2024-01-26 Thread Li Wei
We found that when only 128-bit vectorization was enabled, 549.fotonik3d_r failed to vectorize effectively. For this reason, we adjust the cost of 128-bit vector_stmt that match the multiply-add pattern to facilitate 128-bit vectorization. The experimental results show that after the modification,

[PATCH v1] LoongArch: Fixed bug in *bstrins__for_ior_mask template.

2023-12-24 Thread Li Wei
We found that using the latest compiled gcc will cause a miscompare error when running spec2006 400.perlbench test with -flto turned on. After testing, it was found that only the LoongArch architecture will report errors. The first error commit was located through the git bisect command as r14-377

[PATCH v1] LoongArch: Merge constant vector permuatation implementations.

2023-12-28 Thread Li Wei
There are currently two versions of the implementations of constant vector permutation: loongarch_expand_vec_perm_const_1 and loongarch_expand_vec_perm_const_2. The implementations of the two versions are different. Currently, only the implementation of loongarch_expand_vec_perm_const_1 is used fo

[PATCH v2] LoongArch: Merge constant vector permuatation implementations.

2023-12-28 Thread Li Wei
There are currently two versions of the implementations of constant vector permutation: loongarch_expand_vec_perm_const_1 and loongarch_expand_vec_perm_const_2. The implementations of the two versions are different. Currently, only the implementation of loongarch_expand_vec_perm_const_1 is used fo

[PATCH v1] LoongArch: Implement C[LT]Z_DEFINED_VALUE_AT_ZERO

2023-11-16 Thread Li Wei
The LoongArch has defined ctz and clz on the backend, but if we want GCC do CTZ transformation optimization in forwprop2 pass, GCC need to know the value of c[lt]z at zero, which may be beneficial for some test cases (like spec2017 deepsjeng_r). After implementing the macro, we test dynamic instru

[PATCH v2] LoongArch: Implement C[LT]Z_DEFINED_VALUE_AT_ZERO

2023-11-16 Thread Li Wei
The LoongArch has defined ctz and clz on the backend, but if we want GCC do CTZ transformation optimization in forwprop2 pass, GCC need to know the value of c[lt]z at zero, which may be beneficial for some test cases (like spec2017 deepsjeng_r). After implementing the macro, we test dynamic instru

[PATCH v1 1/2] LoongArch: Accelerate optimization of scalar signed/unsigned popcount.

2023-11-27 Thread Li Wei
In LoongArch, the vector popcount has corresponding instructions, while the scalar does not. Currently, the scalar popcount is calculated through a loop, and the value of a non-power of two needs to be iterated several times, so the vector popcount instruction is considered for optimization. gcc/C

[PATCH v1 2/2] LoongArch: Optimize vector constant extract-{even/odd} permutation.

2023-11-27 Thread Li Wei
For vector constant extract-{even/odd} permutation replace the default [x]vshuf instruction combination with [x]vilv{l/h} instruction, which can reduce instructions and improves performance. gcc/ChangeLog: * config/loongarch/loongarch.cc (loongarch_is_odd_extraction): Supplement

[PATCH v1] LoongArch: Remove duplicate definition of CLZ_DEFINED_VALUE_AT_ZERO.

2023-11-27 Thread Li Wei
In the r14-5547 commit, C[LT]Z_DEFINED_VALUE_AT_ZERO were defined at the same time, but in fact, CLZ_DEFINED_VALUE_AT_ZERO has already been defined, so remove the duplicate definition. gcc/ChangeLog: * config/loongarch/loongarch.h (CTZ_DEFINED_VALUE_AT_ZERO): Add description.

[PATCH v1] LoongArch: Adjust the vector cost model for better performance

2023-09-18 Thread Li Wei
gcc/ChangeLog: * config/loongarch/loongarch.cc (loongarch_builtin_vectorization_cost): --- gcc/config/loongarch/loongarch.cc | 21 ++--- 1 file changed, 14 insertions(+), 7 deletions(-) diff --git a/gcc/config/loongarch/loongarch.cc b/gcc/config/loongarch/loongarch.cc in

[PATCH v1 2/3] LoongArch: Optimize [x]vshuf insn to [x]vbitsel insn in some shuffle cases.

2025-02-05 Thread Li Wei
Currently, the shuffle in which LoongArch selects two vectors at corresponding positions is implemented through the [x]vshuf instruction, but this will introduce additional index copies. In this case, the [x]vbitsel.v instruction can be used for optimization. gcc/ChangeLog: * config/loong

[PATCH v1 3/3] LoongArch: General xvbitsel insn combinations optimization for special type.

2025-02-05 Thread Li Wei
In LoongArch, when the permutation idx comes from different vectors and idx is not repeated, for V8SI/V8SF/V4DI/V4DF type vectors, we can use two xvperm.w + one xvbitsel.v instructions or two xvpermi.d + one xvbitsel.v instructions for shuffle optimization. gcc/ChangeLog: * config/loongar

[PATCH v1 1/3] LoongArch: Optimize two 256-bit vectors correspond highpart and lowpart splicing shuffle.

2025-02-05 Thread Li Wei
In LoongArch, we have xvshuf.{b/h/w/d} instructions which can dealt the situation that all low 128-bit elements of the target vector are shuffled by concatenating the low 128-bit elements of the two input vectors, and all high 128-bit elements of the target vector are similarly shuffled. Therefore,

[PATCH v1 3/3] LoongArch: General xvbitsel insn combinations optimization for special type.

2025-02-21 Thread Li Wei
In LoongArch, when the permutation idx comes from different vectors and idx is not repeated, for V8SI/V8SF/V4DI/V4DF type vectors, we can use two xvperm.w + one xvbitsel.v instructions or two xvpermi.d + one xvbitsel.v instructions for shuffle optimization. gcc/ChangeLog: * config/loongar

[PATCH v1 2/3] LoongArch: Optimize [x]vshuf insn to [x]vbitsel insn in some shuffle cases.

2025-02-21 Thread Li Wei
Currently, the shuffle in which LoongArch selects two vectors at corresponding positions is implemented through the [x]vshuf instruction, but this will introduce additional index copies. In this case, the [x]vbitsel.v instruction can be used for optimization. gcc/ChangeLog: * config/loong

[PATCH v2 2/3] LoongArch: Optimize [x]vshuf insn to [x]vbitsel insn in some shuffle cases.

2025-02-21 Thread Li Wei
Currently, the shuffle in which LoongArch selects two vectors at corresponding positions is implemented through the [x]vshuf instruction, but this will introduce additional index copies. In this case, the [x]vbitsel.v instruction can be used for optimization. gcc/ChangeLog: * config/loong

[PATCH v2 1/3] LoongArch: Optimize two 256-bit vectors correspond highpart and lowpart splicing shuffle.

2025-02-21 Thread Li Wei
In LoongArch, we have xvshuf.{b/h/w/d} instructions which can dealt the situation that all low 128-bit elements of the target vector are shuffled by concatenating the low 128-bit elements of the two input vectors, and all high 128-bit elements of the target vector are similarly shuffled. Therefore,

[PATCH v2 3/3] LoongArch: General xvbitsel insn combinations optimization for special type.

2025-02-21 Thread Li Wei
In LoongArch, when the permutation idx comes from different vectors and idx is not repeated, for V8SI/V8SF/V4DI/V4DF type vectors, we can use two xvperm.w + one xvbitsel.v instructions or two xvpermi.d + one xvbitsel.v instructions for shuffle optimization. gcc/ChangeLog: * config/loongar

[PATCH v1 1/3] LoongArch: Optimize two 256-bit vectors correspond highpart and lowpart splicing shuffle.

2025-02-21 Thread Li Wei
In LoongArch, we have xvshuf.{b/h/w/d} instructions which can dealt the situation that all low 128-bit elements of the target vector are shuffled by concatenating the low 128-bit elements of the two input vectors, and all high 128-bit elements of the target vector are similarly shuffled. Therefore,