Pushed r15-9167: [PATCH] LoongArch: Make gen-evolution.awk compatible with FreeBSD awk

2025-04-04 Thread Xi Ruoyao
On Thu, 2025-04-03 at 10:13 +0800, Lulu Cheng wrote: > > 在 2025/4/2 上午11:19, Xi Ruoyao 写道: > > Avoid using gensub that FreeBSD awk lacks, use gsub and split those > > each > > of gawk, mawk, and FreeBSD awk provides. > > > > Reported-by: mp...@vip.163.com &

[PATCH] LoongArch: Make gen-evolution.awk compatible with FreeBSD awk

2025-04-01 Thread Xi Ruoyao
Avoid using gensub that FreeBSD awk lacks, use gsub and split those each of gawk, mawk, and FreeBSD awk provides. Reported-by: mp...@vip.163.com Link: https://man.freebsd.org/cgi/man.cgi?query=awk gcc/ChangeLog: * config/loongarch/genopts/gen-evolution.awk: Avoid using gensub tha

[gcc-14 PATCH] Reuse scratch registers generated by LRA

2025-03-27 Thread Xi Ruoyao
From: Denis Chertykov Test file: udivmoddi.c problem insn: 484 Before LRA pass we have: (insn 484 483 485 72 (parallel [ (set (reg/v:SI 143 [ __q1 ]) (plus:SI (reg/v:SI 143 [ __q1 ]) (const_int -2 [0xfffe]))) (clobber (scrat

[PATCH] LoongArch: Add ABI names for FPR

2025-03-15 Thread Xi Ruoyao
We already allow the ABI names for GPR in inline asm clobber list, so for consistency allow the ABI names for FPR as well. Reported-by: Yao Zi gcc/ChangeLog: * config/loongarch/loongarch.h (ADDITIONAL_REGISTER_NAMES): Add fa0-fa7, ft0-ft16, and fs0-fs7. gcc/testsuite/ChangeLog:

[PATCH] LoongArch: Don't use C++17 feature [PR119238]

2025-03-12 Thread Xi Ruoyao
Structured binding is a C++17 feature but the GCC code base is in C++14. gcc/ChangeLog: PR target/119238 * config/loongarch/simd.md (dot_prod): Stop using structured binding. --- Ok for trunk? gcc/config/loongarch/simd.md | 14 -- 1 file changed, 8 insertion

[PATCH] LoongArch: Fix ICE when trying to recognize bitwise + alsl.w pair [PR119127]

2025-03-11 Thread Xi Ruoyao
When we call loongarch_reassoc_shift_bitwise for _alsl_reversesi_extend, the mask is in DImode but we are trying to operate it in SImode, causing an ICE. To fix the issue sign-extend the mask into the mode we want. And also specially handle the case the mask is extended into -1 to avoid a miss-op

Re: [PATCH] LoongArch: Fix incorrect reorder of __lsx_vldx and __lasx_xvldx [PR119084]

2025-03-04 Thread Xi Ruoyao
On Wed, 2025-03-05 at 10:52 +0800, Lulu Cheng wrote: > LGTM! Pushed to trunk. The draft of gcc-14 backport is attached, I'll push it if it builds & tests fine and there's no objection. -- Xi Ruoyao School of Aerospace Science and Technology, Xidia

[PATCH 08/17] LoongArch: Implement subword atomic_fetch_{and, or, xor} with am*.w instructions

2025-03-03 Thread Xi Ruoyao
We can just shift the mask and fill the other bits with 0 (for ior/xor) or 1 (for and), and use an am*.w instruction to perform the atomic operation, instead of using a LL-SC loop. gcc/ChangeLog: * config/loongarch/sync.md (UNSPEC_COMPARE_AND_SWAP_AND): Remove. (UNSPEC_COM

[PATCH 13/17] LoongArch: Add -m[no-]scq option

2025-03-03 Thread Xi Ruoyao
We'll use the sc.q instruction for some 16-byte atomic operations, but it's only added in LoongArch 1.1 evolution so we need to gate it with an option. gcc/ChangeLog: * config/loongarch/genopts/isa-evolution.in (scq): New evolution feature. * config/loongarch/loongarch-evo

[PATCH] LoongArch: Fix incorrect reorder of __lsx_vldx and __lasx_xvldx [PR119084]

2025-03-02 Thread Xi Ruoyao
They could be incorrectly reordered with store instructions like st.b because the RTL expression does not have a memory_operand or a (mem) expression. The incorrect reorder has been observed in openh264 LTO build. Expand them to a (mem) expression instead of unspec to fix the issue. Then we need

[PATCH 01/17] LoongArch: (NFC) Remove atomic_optab and use amop instead

2025-03-02 Thread Xi Ruoyao
They are the same. gcc/ChangeLog: * config/loongarch/sync.md (atomic_optab): Remove. (atomic_): Change atomic_optab to amop. (atomic_fetch_): Likewise. --- gcc/config/loongarch/sync.md | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/gcc/config/lo

[PATCH 05/17] LoongArch: Don't emit overly-restrictive barrier for LL-SC loops

2025-03-01 Thread Xi Ruoyao
For LL-SC loops, if the atomic operation has succeeded, the SC instruction always imply a full barrier, so the barrier we manually inserted only needs to take the account for the failure memorder, not the success memorder (the barrier is skipped with "b 3f" on success anyway). Note that if we use

[PATCH 17/17] LoongArch: Implement 16-byte atomic add, sub, and, or, xor, and nand with sc.q

2025-03-01 Thread Xi Ruoyao
gcc/ChangeLog: * config/loongarch/sync.md (UNSPEC_TI_FETCH_ADD): New unspec. (UNSPEC_TI_FETCH_SUB): Likewise. (UNSPEC_TI_FETCH_AND): Likewise. (UNSPEC_TI_FETCH_XOR): Likewise. (UNSPEC_TI_FETCH_OR): Likewise. (UNSPEC_TI_FETCH_NAND_MASK_INVERTED): Like

[PATCH 11/17] LoongArch: Implement 16-byte atomic load with LSX

2025-03-01 Thread Xi Ruoyao
If the vector is naturally aligned, it cannot cross cache lines so the LSX load is guaranteed to be atomic. Thus we can use LSX to do the lock-free atomic load, instead of using a lock. gcc/ChangeLog: * config/loongarch/sync.md (atomic_loadti_lsx): New define_insn. (atomic_loadti

[PATCH 03/17] LoongArch: Don't use "+" for atomic_{load, store} "m" constraint

2025-02-28 Thread Xi Ruoyao
Atomic load does not modify the memory. Atomic store does not read the memory, thus we can use "=" instead. gcc/ChangeLog: * config/loongarch/sync.md (atomic_load): Remove "+" for the memory operand. (atomic_store): Use "=" instead of "+" for the memory operand. -

[PATCH 15/17] LoongArch: Implement 16-byte CAS with sc.q

2025-02-28 Thread Xi Ruoyao
gcc/ChangeLog: * config/loongarch/sync.md (atomic_compare_and_swapti_scq): New define_insn. (atomic_compare_and_swapti): New define_expand. --- gcc/config/loongarch/sync.md | 89 1 file changed, 89 insertions(+) diff --git a/gcc/config

[PATCH 10/17] LoongArch: Implement atomic_fetch_nand

2025-02-28 Thread Xi Ruoyao
Without atomic_fetch_nandsi and atomic_fetch_nanddi, __atomic_fetch_nand is expanded to a loop containing a CAS in the body, and CAS itself is a LL-SC loop so we have a nested loop. This is obviously not a good idea as we just need one LL-SC loop in fact. As ~(atom & mask) is (~mask) | (~atom), w

[PATCH 06/17] LoongArch: Remove unneeded "b 3f" instruction after LL-SC loops

2025-02-28 Thread Xi Ruoyao
This instruction is used to skip an redundant barrier if -mno-ld-seq-sa or the memory model requires a barrier on failure. But with -mld-seq-sa and other memory models the barrier may be nonexisting at all, and we should remove the "b 3f" instruction as well. The implementation uses a new operand

[PATCH 16/17] LoongArch: Implement 16-byte atomic exchange with sc.q

2025-02-28 Thread Xi Ruoyao
gcc/ChangeLog: * config/loongarch/sync.md (atomic_exchangeti_scq): New define_insn. (atomic_exchangeti): New define_expand. --- gcc/config/loongarch/sync.md | 35 +++ 1 file changed, 35 insertions(+) diff --git a/gcc/config/loongarch/sync.m

[PATCH 09/17] LoongArch: Don't expand atomic_fetch_sub_{hi, qi} to LL-SC loop if -mlam-bh

2025-02-28 Thread Xi Ruoyao
With -mlam-bh, we should negate the addend first, and use an amadd instruction. Disabling the expander makes the compiler do it correctly. gcc/ChangeLog: * config/loongarch/sync.md (atomic_fetch_sub): Disable if ISA_HAS_LAM_BH. --- gcc/config/loongarch/sync.md | 2 +- 1 file cha

[PATCH 14/17] LoongArch: Implement 16-byte atomic store with sc.q

2025-02-28 Thread Xi Ruoyao
When LSX is not available but sc.q is (for example on LA664 where the SIMD unit is not enabled), we can use a LL-SC loop for 16-byte atomic store. gcc/ChangeLog: * config/loongarch/loongarch.cc (loongarch_print_operand_reloc): Accept "%t" for printing the number of the 64-bit mach

[PATCH 12/17] LoongArch: Implement 16-byte atomic store with LSX

2025-02-28 Thread Xi Ruoyao
If the vector is naturally aligned, it cannot cross cache lines so the LSX store is guaranteed to be atomic. Thus we can use LSX to do the lock-free atomic store, instead of using a lock. gcc/ChangeLog: * config/loongarch/sync.md (atomic_storeti_lsx): New define_insn. (at

[PATCH 07/17] LoongArch: Remove unneeded "andi offset, addr, 3" instruction in atomic_test_and_set

2025-02-28 Thread Xi Ruoyao
On LoongArch sll.w and srl.w instructions only take the [4:0] bits of rk (shift amount) into account, and we've already defined SHIFT_COUNT_TRUNCATED to 1 so the compiler knows this fact, thus we don't need this instruction. gcc/ChangeLog: * config/loongarch/sync.md (atomic_test_and_set):

[PATCH 02/17] LoongArch: (NFC) Remove amo and use size instead

2025-02-28 Thread Xi Ruoyao
They are the same. gcc/ChangeLog: * config/loongarch/sync.md: Use instead of . (amo): Remove. --- gcc/config/loongarch/sync.md | 53 +--- 1 file changed, 25 insertions(+), 28 deletions(-) diff --git a/gcc/config/loongarch/sync.md b/gcc/config/loo

[PATCH 04/17] LoongArch: Allow using bstrins for masking the address in atomic_test_and_set

2025-02-28 Thread Xi Ruoyao
We can use bstrins for masking the address here. As people are already working on LA32R (which lacks bstrins instructions), for future-proofing we check whether (const_int -4) is an and_operand and force it into an register if not. gcc/ChangeLog: * config/loongarch/sync.md (atomic_test_a

[PATCH 00/17] LoongArch: Clean up atomic operations and implement 16-byte atomic operations

2025-02-28 Thread Xi Ruoyao
The entire patch bootstrapped and regtested on loongarch64-linux-gnu with -march=la664, and I've also tried several simple 16-byte atomic operation tests locally. OK for trunk? Or maybe the clean up is OK but the 16-byte atomic implementation still needs to be confirmed by the hardware team

[PATCH] LoongArch: Add a dedicated pattern for bitwise + alsl

2025-02-28 Thread Xi Ruoyao
We've implemented the slli + bitwise => bitwise + slli reassociation in r15-7062. I'd hoped late combine could handle slli.d + bitwise + add.d => bitwise + slli.d + add.d => bitwise => alsl.d, but it does not always work, for example a |= 0xfff; b |= 0xfff; a <<= 2; b <<= 2; a += x; b

Re: [PATCH] LoongArch: Avoid unnecessary zero-initialization using LSX for scalar popcount

2025-02-25 Thread Xi Ruoyao
On Tue, 2025-02-25 at 20:49 +0800, Lulu Cheng wrote: > > 在 2025/2/22 下午3:34, Xi Ruoyao 写道: > > Now for __builtin_popcountl we are getting things like > > > > vrepli.b$vr0,0 > > vinsgr2vr.d $vr0,$r4,0 > > vpcnt.d $vr0,$vr0 > >

[PATCH] LoongArch: Avoid unnecessary zero-initialization using LSX for scalar popcount

2025-02-21 Thread Xi Ruoyao
Now for __builtin_popcountl we are getting things like vrepli.b$vr0,0 vinsgr2vr.d $vr0,$r4,0 vpcnt.d $vr0,$vr0 vpickve2gr.du $r4,$vr0,0 slli.w $r4,$r4,0 jr $r1 The "vrepli.b" instruction is introduced by the init-regs pass (see PR618

Re: [RFC] RISC-V: The optimization ignored the side effects of the rounding mode, resulting in incorrect results.

2025-02-19 Thread Xi Ruoyao
assumptions about the rounding modes in > floating-point > calculations, such as in float_extend, which may prevent CSE optimizations. > Could > this also lead to lost optimization opportunities in other areas that don't > require > this option? I'm not sure. > > I suspect that the best approach would be to define relevant > attributes (perhaps similar to -frounding-math) within specific related > patterns/built-ins > to inform optimizers we are using a rounding mode and to avoid > over-optimization. The "special pattern" is supposed to be #pragma STDC FENV_ACCESS that we've not implemented. See https://gcc.gnu.org/PR34678. -- Xi Ruoyao School of Aerospace Science and Technology, Xidian University

Ping: [PATCH] testsuite: Fix up toplevel-asm-1.c for LoongArch

2025-02-18 Thread Xi Ruoyao
On Wed, 2025-02-05 at 08:57 +0800, Xi Ruoyao wrote: > Like RISC-V, on LoongArch we don't really support %cN for SYMBOL_REFs > even with -fno-pic. > > gcc/testsuite/ChangeLog: > > * c-c++-common/toplevel-asm-1.c: Use %cc3 %cc4 instead of %c3 > %c4 on LoongArc

[PATCH] LoongArch: Use normal RTL pattern instead of UNSPEC for {x, }vsr{a, l}ri instructions

2025-02-14 Thread Xi Ruoyao
Allowing (t + (1ul << imm >> 1)) >> imm to be recognized as a rounding shift operation. gcc/ChangeLog: * config/loongarch/lasx.md (UNSPEC_LASX_XVSRARI): Remove. (UNSPEC_LASX_XVSRLRI): Remove. (lasx_xvsrari_): Remove. (lasx_xvsrlri_): Remove. * config/loonga

Re: [PATCH v2 2/8] LoongArch: Allow moving TImode vectors

2025-02-14 Thread Xi Ruoyao
On Fri, 2025-02-14 at 15:46 +0800, Lulu Cheng wrote: > Hi, > > If only apply the first and second patches, the code will not compile. > > Otherwise LGTM. Fixed in v3: https://gcc.gnu.org/pipermail/gcc-patches/2025-February/675776.html -- Xi Ruoyao School of Aerospace Science

[PATCH v3 5/8] LoongArch: Simplify {lsx_,lasx_x}vmaddw description

2025-02-14 Thread Xi Ruoyao
Like what we've done for {lsx_,lasx_x}v{add,sub,mul}l{ev,od}, use special predicates and TImode RTL instead of hard-coded const vectors and UNSPECs. Also reorder two operands of the outer plus in the template, so combine will recognize {x,}vadd + {x,}vmulw{ev,od} => {x,}vmaddw{ev,od}. gcc/ChangeL

[PATCH v3 4/8] LoongArch: Simplify {lsx_, lasx_x}vh{add, sub}w description

2025-02-14 Thread Xi Ruoyao
Like what we've done for {lsx_,lasx_x}v{add,sub,mul}l{ev,od}, use special predicates and TImode RTL instead of hard-coded const vectors and UNSPECs. gcc/ChangeLog: * config/loongarch/lasx.md (UNSPEC_LASX_XVHADDW_Q_D): Remove. (UNSPEC_LASX_XVHSUBW_Q_D): Remove. (UNSPEC_LASX

[PATCH v3 6/8] LoongArch: Simplify lsx_vpick description

2025-02-14 Thread Xi Ruoyao
Like what we've done for {lsx_,lasx_x}v{add,sub,mul}l{ev,od}, use special predicates instead of hard-coded const vectors. This is not suitable for LASX where lasx_xvpick has a different semantic. gcc/ChangeLog: * config/loongarch/simd.md (LVEC): New define_mode_attr. (simdfmt_as_

[PATCH v3 8/8] LoongArch: Implement [su]dot_prod* for LSX and LASX modes

2025-02-14 Thread Xi Ruoyao
Despite it's just a special case of "a widening product of which the result used for reduction," having these standard names allows to recognize the dot product pattern earlier and it may be beneficial to optimization. Also fix some test failures with the test cases: - gcc.dg/vect/vect-reduc-chai

[PATCH v3 7/8] LoongArch: Implement vec_widen_mult_{even, odd}_* for LSX and LASX modes

2025-02-14 Thread Xi Ruoyao
Since PR116142 has been fixed, now we can add the standard names so the compiler will generate better code if the result of a widening production is reduced. gcc/ChangeLog: * config/loongarch/simd.md (even_odd): New define_int_attr. (vec_widen_mult__): New define_expand. gcc/test

[PATCH v3 1/8] LoongArch: Try harder using vrepli instructions to materialize const vectors

2025-02-14 Thread Xi Ruoyao
For a = (v4si){0x, 0x, 0x, 0x} we just want vrepli.b $vr0, 0xdd but the compiler actually produces a load: la.local $r14,.LC0 vld $vr0,$r14,0 It's because we only tried vrepli.d which wouldn't work. Try all vrepli instructions for const int vector

[PATCH v3 2/8] LoongArch: Allow moving TImode vectors

2025-02-14 Thread Xi Ruoyao
We have some vector instructions for operations on 128-bit integer, i.e. TImode, vectors. Previously they had been modeled with unspecs, but it's more natural to just model them with TImode vector RTL expressions. For the preparation, allow moving V1TImode and V2TImode vectors in LSX and LASX reg

[PATCH v3 3/8] LoongArch: Simplify {lsx_, lasx_x}v{add, sub, mul}l{ev, od} description

2025-02-14 Thread Xi Ruoyao
These pattern definitions are tediously long, invoking 32 UNSPECs and many hard-coded long const vectors. To simplify them, at first we use the TImode vector operations instead of the UNSPECs, then we adopt an approach in AArch64: using a special predicate to match the const vectors for odd/even i

[PATCH v3 0/8] LoongArch: SIMD odd/even/horizontal widening arithmetic cleanup and optimization

2025-02-14 Thread Xi Ruoyao
tested on loongarch64-linux-gnu, no new code change in v3. Ok for trunk? Xi Ruoyao (8): LoongArch: Try harder using vrepli instructions to materialize const vectors LoongArch: Allow moving TImode vectors LoongArch: Simplify {lsx_,lasx_x}v{add,sub,mul}l{ev,od} description LoongArch: Si

[PATCH v2 7/8] LoongArch: Implement vec_widen_mult_{even, odd}_* for LSX and LASX modes

2025-02-13 Thread Xi Ruoyao
Since PR116142 has been fixed, now we can add the standard names so the compiler will generate better code if the result of a widening production is reduced. gcc/ChangeLog: * config/loongarch/simd.md (even_odd): New define_int_attr. (vec_widen_mult__): New define_expand. gcc/test

Re: [PATCH 6/8] LoongArch: Simplify {lsx,lasx_x}vpick description

2025-02-13 Thread Xi Ruoyao
n test the optimal > values > > for -malign-{functions,labels,jumps,loops} on that basis. Thanks! -- Xi Ruoyao School of Aerospace Science and Technology, Xidian University

[PATCH v2 6/8] LoongArch: Simplify lsx_vpick description

2025-02-13 Thread Xi Ruoyao
Like what we've done for {lsx_,lasx_x}v{add,sub,mul}l{ev,od}, use special predicates instead of hard-coded const vectors. This is not suitable for LASX where lasx_xvpick has a different semantic. gcc/ChangeLog: * config/loongarch/simd.md (LVEC): New define_mode_attr. (simdfmt_as_

[PATCH v2 5/8] LoongArch: Simplify {lsx_,lasx_x}vmaddw description

2025-02-13 Thread Xi Ruoyao
Like what we've done for {lsx_,lasx_x}v{add,sub,mul}l{ev,od}, use special predicates and TImode RTL instead of hard-coded const vectors and UNSPECs. Also reorder two operands of the outer plus in the template, so combine will recognize {x,}vadd + {x,}vmulw{ev,od} => {x,}vmaddw{ev,od}. gcc/ChangeL

[PATCH v2 8/8] LoongArch: Implement [su]dot_prod* for LSX and LASX modes

2025-02-13 Thread Xi Ruoyao
Despite it's just a special case of "a widening product of which the result used for reduction," having these standard names allows to recognize the dot product pattern earlier and it may be beneficial to optimization. Also fix some test failures with the test cases: - gcc.dg/vect/vect-reduc-chai

[PATCH v2 3/8] LoongArch: Simplify {lsx_, lasx_x}v{add, sub, mul}l{ev, od} description

2025-02-13 Thread Xi Ruoyao
These pattern definitions are tediously long, invoking 32 UNSPECs and many hard-coded long const vectors. To simplify them, at first we use the TImode vector operations instead of the UNSPECs, then we adopt an approach in AArch64: using a special predicate to match the const vectors for odd/even i

[PATCH v2 2/8] LoongArch: Allow moving TImode vectors

2025-02-13 Thread Xi Ruoyao
We have some vector instructions for operations on 128-bit integer, i.e. TImode, vectors. Previously they had been modeled with unspecs, but it's more natural to just model them with TImode vector RTL expressions. For the preparation, allow moving V1TImode and V2TImode vectors in LSX and LASX reg

[PATCH v2 1/8] LoongArch: Try harder using vrepli instructions to materialize const vectors

2025-02-13 Thread Xi Ruoyao
For a = (v4si){0x, 0x, 0x, 0x} we just want vrepli.b $vr0, 0xdd but the compiler actually produces a load: la.local $r14,.LC0 vld $vr0,$r14,0 It's because we only tried vrepli.d which wouldn't work. Try all vrepli instructions for const int vector

[PATCH v2 4/8] LoongArch: Simplify {lsx_, lasx_x}vh{add, sub}w description

2025-02-13 Thread Xi Ruoyao
Like what we've done for {lsx_,lasx_x}v{add,sub,mul}l{ev,od}, use special predicates and TImode RTL instead of hard-coded const vectors and UNSPECs. gcc/ChangeLog: * config/loongarch/lasx.md (UNSPEC_LASX_XVHADDW_Q_D): Remove. (UNSPEC_LASX_XVHSUBW_Q_D): Remove. (UNSPEC_LASX

[PATCH v2 0/8] LoongArch: SIMD odd/even/horizontal widening arithmetic cleanup and optimization

2025-02-13 Thread Xi Ruoyao
is selected for the left operand of addsub. Swap the operands if needed when outputting the asm. - Fix typos in commit subjects. - Mention V2TI in loongarch-modes.def. Bootstrapped and regtested on loongarch64-linux-gnu. Ok for trunk? Xi Ruoyao (8): LoongArch: Try harder using vrepli instructions

Re: [PATCH v2 3/4] LoongArch: After setting the compilation options, update the predefined macros.

2025-02-12 Thread Xi Ruoyao
On Thu, 2025-02-13 at 09:24 +0800, Lulu Cheng wrote: > > 在 2025/2/12 下午6:19, Xi Ruoyao 写道: > > On Wed, 2025-02-12 at 18:03 +0800, Lulu Cheng wrote: > > > > /* snip */ > > > > > diff --git a/gcc/testsuite/gcc.target/loongarch/pr118828-3.c > > > b

Re: [PATCH v2 3/4] LoongArch: After setting the compilation options, update the predefined macros.

2025-02-12 Thread Xi Ruoyao
oongarch/pr118828-4.c > @@ -0,0 +1,55 @@ > +/* { dg-do run } */ > +/* { dg-options "-mtune=la464" } */ > + > +#include > +#include > +#include > + > +#ifndef __loongarch_tune > +#error __loongarch_tune should not be available here Likewise. -- Xi Ruoyao School of Aerospace Science and Technology, Xidian University

Re: [PATCH 6/8] LoongArch: Simplify {lsx,lasx_x}vpick description

2025-02-11 Thread Xi Ruoyao
On Tue, 2025-02-11 at 16:52 +0800, Lulu Cheng wrote: > > 在 2025/2/7 下午8:09, Xi Ruoyao 写道: > /* snip */ > > - > > -(define_insn "lasx_xvpickev_w" > > -  [(set (match_operand:V8SI 0 "register_operand" "=f") > > - (vec_select:V8S

Re: [PATCH 2/3] LoongArch: Split the function loongarch_cpu_cpp_builtins into two functions.

2025-02-11 Thread Xi Ruoyao
_LSX) > -    { > -  builtin_define ("__loongarch_simd"); > -  builtin_define ("__loongarch_sx"); > - > -  if (!ISA_HAS_LASX) > - builtin_define ("__loongarch_simd_width=128"); > -    } > - > -  if (ISA_HAS_LASX) > -    { >

Re: [PATCH 5/8] LoongArch: Simplify {lsx_,lasx_x}maddw description

2025-02-11 Thread Xi Ruoyao
On Tue, 2025-02-11 at 15:49 +0800, Lulu Cheng wrote: > It seems that the title here is "{lsx_,lasx_x}vmaddw". Will fix in v2. -- Xi Ruoyao School of Aerospace Science and Technology, Xidian University

Re: [PATCH 4/8] LoongArch: Simplify {lsx_,lasx_x}hv{add,sub}w description

2025-02-11 Thread Xi Ruoyao
On Tue, 2025-02-11 at 15:48 +0800, Lulu Cheng wrote: > Hi, > >   I think , the "{lsx_,lasx_x}hv{add,sub}w" in the title should be > "{lsx_,lasx_x}vh{add,sub}w". Indeed. > > 在 2025/2/7 下午8:09, Xi Ruoyao 写道: > > Like what we've done for {ls

[PATCH] LoongArch: Accept ADD, IOR or XOR when combining objects with no bits in common [PR115478]

2025-02-10 Thread Xi Ruoyao
Since r15-1120, multi-word shifts/rotates produces PLUS instead of IOR. It's generally a good thing (allowing to use our alsl instruction or similar instrunction on other architectures), but it's preventing us from using bytepick. For example, if we shift a __int128 by 16 bits, the higher word can

[PATCH 7/8] LoongArch: Implement vec_widen_mult_{even, odd}_* for LSX and LASX modes

2025-02-07 Thread Xi Ruoyao
Since PR116142 has been fixed, now we can add the standard names so the compiler will generate better code if the result of a widening production is reduced. gcc/ChangeLog: * config/loongarch/simd.md (even_odd): New define_int_attr. (vec_widen_mult__): New define_expand. gcc/test

[PATCH 5/8] LoongArch: Simplify {lsx_,lasx_x}maddw description

2025-02-07 Thread Xi Ruoyao
Like what we've done for {lsx_,lasx_x}v{add,sub,mul}l{ev,od}, use special predicates and TImode RTL instead of hard-coded const vectors and UNSPECs. Also reorder two operands of the outer plus in the template, so combine will recognize {x,}vadd + {x,}vmulw{ev,od} => {x,}vmaddw{ev,od}. gcc/ChangeL

[PATCH 3/8] LoongArch: Simplify {lsx_, lasx_x}v{add, sub, mul}l{ev, od} description

2025-02-07 Thread Xi Ruoyao
These pattern definitions are tediously long, invoking 32 UNSPECs and many hard-coded long const vectors. To simplify them, at first we use the TImode vector operations instead of the UNSPECs, then we adopt an approach in AArch64: using a special predicate to match the const vectors for odd/even i

[PATCH 8/8] LoongArch: Implement [su]dot_prod* for LSX and LASX modes

2025-02-07 Thread Xi Ruoyao
Despite it's just a special case of "a widening product of which the result used for reduction," having these standard names allows to recognize the dot product pattern earlier and it may be beneficial to optimization. Also fix some test failures with the test cases: - gcc.dg/vect/vect-reduc-chai

[PATCH 6/8] LoongArch: Simplify {lsx,lasx_x}vpick description

2025-02-07 Thread Xi Ruoyao
Like what we've done for {lsx_,lasx_x}v{add,sub,mul}l{ev,od}, use special predicates instead of hard-coded const vectors. gcc/ChangeLog: * config/loongarch/lasx.md (lasx_xvpickev_b): Remove. (lasx_xvpickev_h): Remove. (lasx_xvpickev_w): Remove. (lasx_xvpickev_w_f):

[PATCH 4/8] LoongArch: Simplify {lsx_, lasx_x}hv{add, sub}w description

2025-02-07 Thread Xi Ruoyao
Like what we've done for {lsx_,lasx_x}v{add,sub,mul}l{ev,od}, use special predicates and TImode RTL instead of hard-coded const vectors and UNSPECs. gcc/ChangeLog: * config/loongarch/lasx.md (UNSPEC_LASX_XVHADDW_Q_D): Remove. (UNSPEC_LASX_XVHSUBW_Q_D): Remove. (UNSPEC_LASX

[PATCH 1/8] LoongArch: Try harder using vrepli instructions to materialize const vectors

2025-02-07 Thread Xi Ruoyao
For a = (v4si){0x, 0x, 0x, 0x} we just want vrepli.b $vr0, 0xdd but the compiler actually produces a load: la.local $r14,.LC0 vld $vr0,$r14,0 It's because we only tried vrepli.d which wouldn't work. Try all vrepli instructions for const int vector

[PATCH 2/8] LoongArch: Allow moving TImode vectors

2025-02-07 Thread Xi Ruoyao
We have some vector instructions for operations on 128-bit integer, i.e. TImode, vectors. Previously they had been modeled with unspecs, but it's more natural to just model them with TImode vector RTL expressions. For the preparation, allow moving V1TImode and V2TImode vectors in LSX and LASX reg

[PATCH 0/8] LoongArch: SIMD odd/even/horizontal widening arithmetic cleanup and optimization

2025-02-07 Thread Xi Ruoyao
. Bootstrapped and regtested on loongarch64-linux-gnu. Ok for trunk? Xi Ruoyao (8): LoongArch: Try harder using vrepli instructions to materialize const vectors LoongArch: Allow moving TImode vectors LoongArch: Simplify {lsx_,lasx_x}v{add,sub,mul}l{ev,od} description LoongArch: Simplify

[PATCH] testsuite: LoongArch: Remove from btrunc, ceil, and floor effective target allowlist

2025-02-07 Thread Xi Ruoyao
Now that C default is C23, so we can no longer use LSX/LASX instructions for these operations as the standard disallows raising INEXACT exceptions. So LoongArch is no longer suitable for these effective targets. Fix the test failures on gcc.dg/vect/vect-rounding-*.c. For the old standards or -ff

Re: [PATCH v1 2/3] LoongArch: Optimize [x]vshuf insn to [x]vbitsel insn in some shuffle cases.

2025-02-05 Thread Xi Ruoyao
quot; "")]) - (define_insn "lsx_vbitseli_b" [(set (match_operand:V16QI 0 "register_operand" "=f") (ior:V16QI (and:V16QI (not:V16QI diff --git a/gcc/config/loongarch/simd.md b/gcc/config/loongarch/simd.md index 611d1f87dd2..49070e829ca 100644 --- a/gcc/config/loongarch/simd.md +++ b/gcc/config/loongarch/simd.md @@ -950,6 +950,19 @@ (define_expand "_maddw_q_du_d_punned" DONE; }) +(define_insn "@simd_vbitsel" + [(set (match_operand:ALLVEC 0 "register_operand" "=f") + (ior:ALLVEC + (and:ALLVEC + (not:ALLVEC (match_operand:ALLVEC 3 "register_operand" "f")) + (match_operand:ALLVEC 1 "register_operand" "f")) + (and:ALLVEC (match_dup 3) + (match_operand:ALLVEC 2 "register_operand" "f"] + "" + "vbitsel.v\t%0,%1,%2,%3" + [(set_attr "type" "simd_bitmov") + (set_attr "mode" "")]) + ; The LoongArch SX Instructions. (include "lsx.md") -- Xi Ruoyao School of Aerospace Science and Technology, Xidian University

[PATCH] testsuite: Fix up toplevel-asm-1.c for LoongArch

2025-02-04 Thread Xi Ruoyao
Like RISC-V, on LoongArch we don't really support %cN for SYMBOL_REFs even with -fno-pic. gcc/testsuite/ChangeLog: * c-c++-common/toplevel-asm-1.c: Use %cc3 %cc4 instead of %c3 %c4 on LoongArch. --- Ok for trunk? gcc/testsuite/c-c++-common/toplevel-asm-1.c | 2 +- 1 file change

[PATCH] vect: Fix wrong code with pr108692.c on targets with only non-widening ABD [PR118727]

2025-02-04 Thread Xi Ruoyao
With things like // signed char a_14, a_16; a.0_4 = (unsigned char) a_14; _5 = (int) a.0_4; b.1_6 = (unsigned char) b_16; _7 = (int) b.1_6; c_17 = _5 - _7; _8 = ABS_EXPR ; r_18 = _8 + r_23; An ABD pattern will be recognized for _8: patt_31 = .ABD (a.0_4, b.1_6); It's still cor

Re: [PATCH] LoongArch: Fix invalid subregs in xorsign [PR118501]

2025-01-22 Thread Xi Ruoyao
On Thu, 2025-01-23 at 11:21 +0800, Lulu Cheng wrote: > > 在 2025/1/22 下午9:26, Xi Ruoyao 写道: > > The test case added in r15-7073 now triggers an ICE, indicating we need > > the same fix as AArch64. > > > > gcc/ChangeLog: > > > > PR target/1185

[PATCH 4/5] LoongArch: Don't emit overly-restrictive barrier for LL-SC loops

2025-01-22 Thread Xi Ruoyao
For LL-SC loops, if the atomic operation has succeeded, the SC instruction always imply a full barrier, so the barrier we manually inserted only needs to take the account for the failure memorder, not the success memorder (the barrier is skipped with "b 3f" on success anyway). Note that if we use

[PATCH 3/5] LoongArch: Allow using bstrins for masking the address in atomic_test_and_set

2025-01-22 Thread Xi Ruoyao
We can use bstrins for masking the address here. As people are already working on LA32R (which lacks bstrins instructions), for future-proofing we check whether (const_int -4) is an and_operand and force it into an register if not. gcc/ChangeLog: * config/loongarch/sync.md (atomic_test_a

[PATCH 0/5] LoongArch: Atomic operation clean-up and micro-optimization

2025-01-22 Thread Xi Ruoyao
Bootstrapped and regtested on loongarch64-linux-gnu. Ok for trunk? Xi Ruoyao (5): LoongArch: (NFC) Remove atomic_optab and use amop instead LoongArch: Don't use "+" for atomic_{load,store} "m" constraint LoongArch: Allow using bstrins for masking the address i

[PATCH 5/5] LoongArch: Remove "b 3f" instruction if unneeded

2025-01-22 Thread Xi Ruoyao
This instruction is used to skip an redundant barrier if -mno-ld-seq-sa or the memory model requires a barrier on failure. But with -mld-seq-sa and other memory models the barrier may be nonexisting at all, and we should remove the "b 3f" instruction as well. The implementation uses a new operand

[PATCH 1/5] LoongArch: (NFC) Remove atomic_optab and use amop instead

2025-01-22 Thread Xi Ruoyao
They are the same. gcc/ChangeLog: * config/loongarch/sync.md (atomic_optab): Remove. (atomic_): Change atomic_optab to amop. (atomic_fetch_): Likewise. --- gcc/config/loongarch/sync.md | 6 ++ 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/gcc/config/lo

[PATCH 2/5] LoongArch: Don't use "+" for atomic_{load, store} "m" constraint

2025-01-22 Thread Xi Ruoyao
Atomic load does not modify the memory. Atomic store does not read the memory, thus we can use "=" instead. gcc/ChangeLog: * config/loongarch/sync.md (atomic_load): Remove "+" for the memory operand. (atomic_store): Use "=" instead of "+" for the memory operand. -

[PATCH] LoongArch: Fix invalid subregs in xorsign [PR118501]

2025-01-22 Thread Xi Ruoyao
The test case added in r15-7073 now triggers an ICE, indicating we need the same fix as AArch64. gcc/ChangeLog: PR target/118501 * config/loongarch/loongarch.md (@xorsign3): Use force_lowpart_subreg. --- Bootstrapped and regtested on loongarch64-linux-gnu, ok for trunk?

Re: [PATCH 1/2] LoongArch: Fix wrong code with _alsl_reversesi_extended

2025-01-22 Thread Xi Ruoyao
On Wed, 2025-01-22 at 10:53 +0800, Xi Ruoyao wrote: > On Wed, 2025-01-22 at 10:37 +0800, Lulu Cheng wrote: > > > > 在 2025/1/22 上午8:49, Xi Ruoyao 写道: > > > The second source register of this insn cannot be the same as the > > > destination reg

Re: [PATCH v2 2/2] LoongArch: Improve reassociation for bitwise operation and left shift [PR 115921]

2025-01-21 Thread Xi Ruoyao
C, 31, 0 alsl.d $a0, $a0, $a1, 2 .endr addi.d $t1, $t1, -1 bnez $t1, 1b ret -DAVOID_RD_EQUAL_RS actually makes the program slower (on LA464)... So I don't think it's really beneficial to deliberate insert a move. -- Xi Ruoyao School of Aerospace Science and Technology, Xidian University

Re: [PATCH 1/2] LoongArch: Fix wrong code with _alsl_reversesi_extended

2025-01-21 Thread Xi Ruoyao
On Wed, 2025-01-22 at 10:37 +0800, Lulu Cheng wrote: > > 在 2025/1/22 上午8:49, Xi Ruoyao 写道: > > The second source register of this insn cannot be the same as the > > destination register. > > > > gcc/ChangeLog: > > > > * config/loongarch/loongarch.

[PATCH 2/2] LoongArch: Partially fix code regression from r15-7062

2025-01-21 Thread Xi Ruoyao
The uarch can fuse bstrpick.d rd,rs1,31,0 and alsl.d rd,rd,rs2,shamt, so for this special case we should use alsl.d instead of slli.d. And I'd hoped late combine to handle slli.d + and + add.d => and + slli.d + add.d => and + alsl.d, but it does not always work (even before the alsl.d special case

[PATCH 1/2] LoongArch: Fix wrong code with _alsl_reversesi_extended

2025-01-21 Thread Xi Ruoyao
The second source register of this insn cannot be the same as the destination register. gcc/ChangeLog: * config/loongarch/loongarch.md (_alsl_reversesi_extended): Add '&' to the destination register constraint and append '0' to the first source register constraint

[PATCH 0/2] LoongArch: Bitwise and shift reassoc fixes

2025-01-21 Thread Xi Ruoyao
the uarch macro-fused operation. The fix is partial because TARGET_SCHED_MACRO_FUSION_PAIR_P will be needed to guarantee the bstrpick.d and alsl instructions are not separated. Bootstrapped and regtested on loongarch64-linux-gnu. Ok for trunk? Xi Ruoyao (2): LoongArch: Fix wrong code with

Re: [PATCH v2 2/2] LoongArch: Improve reassociation for bitwise operation and left shift [PR 115921]

2025-01-21 Thread Xi Ruoyao
On Tue, 2025-01-21 at 23:18 +0800, Xi Ruoyao wrote: > On Tue, 2025-01-21 at 22:14 +0800, Xi Ruoyao wrote: > > > > in GCC 13 the result is: > > > > > > > >   or $r12,$r4,$r0 > > > > > > Hmm, this strange move is caused by &quo

Re: [PATCH v2 2/2] LoongArch: Improve reassociation for bitwise operation and left shift [PR 115921]

2025-01-21 Thread Xi Ruoyao
On Tue, 2025-01-21 at 22:14 +0800, Xi Ruoyao wrote: > > > in GCC 13 the result is: > > > > > >   or $r12,$r4,$r0 > > > > Hmm, this strange move is caused by "&" in bstrpick_alsl_paired.  Is it > > really needed for the fusion? >

Re: [PATCH v2 2/2] LoongArch: Improve reassociation for bitwise operation and left shift [PR 115921]

2025-01-21 Thread Xi Ruoyao
On Tue, 2025-01-21 at 22:14 +0800, Xi Ruoyao wrote: > On Tue, 2025-01-21 at 21:52 +0800, Xi Ruoyao wrote: > > > struct Pair { unsigned long a, b; }; > > > > > > struct Pair > > > test (struct Pair p, long x, long y) > > > { > > >

Re: [PATCH v2 2/2] LoongArch: Improve reassociation for bitwise operation and left shift [PR 115921]

2025-01-21 Thread Xi Ruoyao
On Tue, 2025-01-21 at 21:52 +0800, Xi Ruoyao wrote: > > struct Pair { unsigned long a, b; }; > > > > struct Pair > > test (struct Pair p, long x, long y) > > { > >   p.a &= 0x; > >   p.a <<= 2; > >   p.a += x; > >   p.

Re: [PATCH v2 2/2] LoongArch: Improve reassociation for bitwise operation and left shift [PR 115921]

2025-01-21 Thread Xi Ruoyao
On Tue, 2025-01-21 at 21:23 +0800, Xi Ruoyao wrote: /* snip */ > > It seems to be more formal through TARGET_SCHED_MACRO_FUSION_P and > > > > TARGET_SCHED_MACRO_FUSION_PAIR_P. I found the spec test item that > > generated > > > > this instruction pair. I i

Re: [PATCH v2 2/2] LoongArch: Improve reassociation for bitwise operation and left shift [PR 115921]

2025-01-21 Thread Xi Ruoyao
On Tue, 2025-01-21 at 20:34 +0800, Lulu Cheng wrote: > > 在 2025/1/21 下午6:05, Xi Ruoyao 写道: > > On Tue, 2025-01-21 at 16:41 +0800, Lulu Cheng wrote: > > > 在 2025/1/21 下午12:59, Xi Ruoyao 写道: > > > > On Tue, 2025-01-21 at 11:46 +0800, Lulu Cheng wrote: > >

Re: [PATCH v2 2/2] LoongArch: Improve reassociation for bitwise operation and left shift [PR 115921]

2025-01-21 Thread Xi Ruoyao
On Tue, 2025-01-21 at 16:41 +0800, Lulu Cheng wrote: > > 在 2025/1/21 下午12:59, Xi Ruoyao 写道: > > On Tue, 2025-01-21 at 11:46 +0800, Lulu Cheng wrote: > > > 在 2025/1/18 下午7:33, Xi Ruoyao 写道: > > > /* snip */ > > > >    ;; This code iterator

Re: [PATCH v2 2/2] LoongArch: Improve reassociation for bitwise operation and left shift [PR 115921]

2025-01-20 Thread Xi Ruoyao
On Tue, 2025-01-21 at 11:46 +0800, Lulu Cheng wrote: > > 在 2025/1/18 下午7:33, Xi Ruoyao 写道: > /* snip */ > >   ;; This code iterator allows unsigned and signed division to be generated > >   ;; from the same template. > > @@ -3083,39 +3084

[PATCH] LoongArch: Correct the mode for mask{eq,ne}z

2025-01-19 Thread Xi Ruoyao
For mask{eq,ne}z, rk is always compared with 0 in the full width, thus the mode for rk should be X. I found the issue reviewing a patch fixing a similar issue for RISC-V XTheadCondMov [1], but interestingly I cannot find a test case really blowing up on LoongArch. But as the issue is obvious enou

Re: [PATCH] RISC-V: Correct the mode that is causing the program to fail for XTheadCondMov

2025-01-19 Thread Xi Ruoyao
e/gcc.target/riscv/xtheadcondmov-bug.c > @@ -0,0 +1,12 @@ > +/* { dg-do compile  { target { rv64 } } } */ > +/* { dg-options "-march=rv64gc_xtheadcondmov -mabi=lp64d -O2" } */ > + > +__attribute__((noinline, noclone)) long long int The attributes are useless as nothing is

[PATCH v2 2/2] LoongArch: Improve reassociation for bitwise operation and left shift [PR 115921]

2025-01-18 Thread Xi Ruoyao
For things like (x | 0x101) << 11 It's obvious to write: ori $r4,$r4,257 slli.d $r4,$r4,11 But we are actually generating something insane: lu12i.w $r12,524288>>12 # 0x8 ori $r12,$r12,2048 slli.d $r4,$r4,11 or

[PATCH v2 1/2] LoongArch: Simplify using bstr{ins, pick} instructions for and

2025-01-18 Thread Xi Ruoyao
For bstrins, we can merge it into and3 instead of having a separate define_insn. For bstrpick, we can use the constraints to ensure the first source register and the destination register are the same hardware register, instead of emitting a move manually. This will simplify the next commit where

Re: [PATCH 2/2] LoongArch: Improve reassociation for bitwise operation and left shift [PR 115921]

2025-01-16 Thread Xi Ruoyao
于 2025年1月15日 GMT+08:00 下午6:12:34,Xi Ruoyao 写道: >For things like > >(x | 0x101) << 11 > >It's obvious to write: > >ori $r4,$r4,257 >slli.d $r4,$r4,11 > >But we are actually generating something insane: > >

Re: [PATCH] LoongArch: Fix cost model for alsl

2025-01-16 Thread Xi Ruoyao
On Thu, 2025-01-16 at 20:52 +0800, Xi Ruoyao wrote: > On Thu, 2025-01-16 at 20:30 +0800, Lulu Cheng wrote: > > > > 在 2025/1/15 下午6:10, Xi Ruoyao 写道: > > > diff --git a/gcc/config/loongarch/loongarch.cc > > > b/gcc/config/loongarch/loongarch.cc > >

  1   2   3   4   5   6   7   8   9   10   >