在 2023/12/13 上午2:27, Xi Ruoyao 写道:
On Tue, 2023-12-12 at 20:39 +0800, Xi Ruoyao wrote:
fld.s $f1,$r4,0
fld.s $f0,$r4,4
fld.s $f3,$r4,8
fld.s $f2,$r4,12
fcmp.slt.s $fcc1,$f0,$f3
fcmp.sgt.s $fcc0,$f1,$f2
movcf2gr $r13,$fcc1
movcf2gr $r12,$fcc0
There is also a problem that on 3A5000 MOVCF2GR requires 7 cycles,
MOVCF2FR+MOVFR2GR is a cycle. 3A6000 has no problem.
or $r12,$r12,$r13
bnez $r12,.L3
fld.s $f4,$r4,16
fld.s $f5,$r4,20
or $r4,$r0,$r0
fcmp.sgt.s $fcc1,$f1,$f5
fcmp.slt.s $fcc0,$f0,$f4
movcf2gr $r12,$fcc1
movcf2gr $r13,$fcc0
or $r12,$r12,$r13
bnez $r12,.L2
fcmp.sgt.s $fcc1,$f3,$f5
fcmp.slt.s $fcc0,$f2,$f4
movcf2gr $r4,$fcc1
movcf2gr $r12,$fcc0
or $r4,$r4,$r12
xori $r4,$r4,1
slli.w $r4,$r4,0
jr $r1
.align 4
.L3:
or $r4,$r0,$r0
.align 4
.L2:
jr $r1
Per my micro-benchmark this is much faster than
LOGICAL_OP_NON_SHORT_CIRCUIT = 0 for randomly generated inputs (i.e.
when the branches are not predictable).
Note that there is a redundant slli.w instruction in the compiled code
and I couldn't find a way to remove it (my trick in the TARGET_64BIT
branch only works for simple examples). We may be able to handle via
the ext_dce pass [1] in the future.
[1]:https://gcc.gnu.org/pipermail/gcc-patches/2023-November/637320.html