> Am 11.10.2024 um 16:43 schrieb Jennifer Schmitz <jschm...@nvidia.com>:
>
>
>
>> On 11 Oct 2024, at 12:08, Richard Sandiford <richard.sandif...@arm.com>
>> wrote:
>>
>> External email: Use caution opening links or attachments
>>
>>
>> Tamar Christina <tamar.christ...@arm.com> writes:
>>>> -----Original Message-----
>>>> From: Richard Biener <rguent...@suse.de>
>>>> Sent: Friday, October 11, 2024 7:52 AM
>>>> To: Richard Sandiford <richard.sandif...@arm.com>
>>>> Cc: Jennifer Schmitz <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org;
>>>> Richard
>>>> Earnshaw <richard.earns...@arm.com>; Kyrylo Tkachov
>>>> <ktkac...@nvidia.com>; Tamar Christina <tamar.christ...@arm.com>
>>>> Subject: Re: [PATCH][PR113816] AArch64: Use SIMD+GPR for logical vector
>>>> reductions
>>>>
>>>>> On Thu, 10 Oct 2024, Richard Sandiford wrote:
>>>>>
>>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>>>> This patch implements the optabs reduc_and_scal_<mode>,
>>>>>>> reduc_ior_scal_<mode>, and reduc_xor_scal_<mode> for ASIMD modes V8QI,
>>>>>>> V16QI, V4HI, and V8HI for TARGET_SIMD to improve codegen for bitwise
>>>>> logical
>>>>>>> vector reduction operations.
>>>>>>> Previously, either only vector registers or only general purpose
>>>>>>> registers (GPR)
>>>>>>> were used. Now, vector registers are used for the reduction from 128 to
>>>>>>> 64
>>>>> bits;
>>>>>>> 64-bit GPR are used for the reduction from 64 to 32 bits; and 32-bit
>>>>>>> GPR are
>>>>> used
>>>>>>> for the rest of the reduction steps.
>>>>>>>
>>>>>>> For example, the test case (V8HI)
>>>>>>> int16_t foo (int16_t *a)
>>>>>>> {
>>>>>>> int16_t b = -1;
>>>>>>> for (int i = 0; i < 8; ++i)
>>>>>>> b &= a[i];
>>>>>>> return b;
>>>>>>> }
>>>>>>>
>>>>>>> was previously compiled to (-O2):
>>>>>>> foo:
>>>>>>> ldr q0, [x0]
>>>>>>> movi v30.4s, 0
>>>>>>> ext v29.16b, v0.16b, v30.16b, #8
>>>>>>> and v29.16b, v29.16b, v0.16b
>>>>>>> ext v31.16b, v29.16b, v30.16b, #4
>>>>>>> and v31.16b, v31.16b, v29.16b
>>>>>>> ext v30.16b, v31.16b, v30.16b, #2
>>>>>>> and v30.16b, v30.16b, v31.16b
>>>>>>> umov w0, v30.h[0]
>>>>>>> ret
>>>>>>>
>>>>>>> With patch, it is compiled to:
>>>>>>> foo:
>>>>>>> ldr q31, [x0]
>>>>>>> ext v30.16b, v31.16b, v31.16b, #8
>>>>>>> and v31.8b, v30.8b, v31.8b
>>>>>>> fmov x0, d31
>>>>>>> and x0, x0, x0, lsr 32
>>>>>>> and w0, w0, w0, lsr 16
>>>>>>> ret
>>>>>>>
>>>>>>> For modes V4SI and V2DI, the pattern was not implemented, because the
>>>>>>> current codegen (using only base instructions) is already efficient.
>>>>>>>
>>>>>>> Note that the PR initially suggested to use SVE reduction ops. However,
>>>>>>> they have higher latency than the proposed sequence, which is why using
>>>>>>> neon and base instructions is preferable.
>>>>>>>
>>>>>>> Test cases were added for 8/16-bit integers for all implemented modes
>>>>>>> and all
>>>>>>> three operations to check the produced assembly.
>>>>>>>
>>>>>>> We also added [istarget aarch64*-*-*] to the selector
>>>>>>> vect_logical_reduc,
>>>>>>> because for aarch64 vector types, either the logical reduction optabs
>>>>>>> are
>>>>>>> implemented or the codegen for reduction operations is good as it is.
>>>>>>> This was motivated by failure of a scan-tree-dump directive in the test
>>>>>>> cases
>>>>>>> gcc.dg/vect/vect-reduc-or_1.c and gcc.dg/vect/vect-reduc-or_2.c.
>>>>>>>
>>>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no
>>>>> regression.
>>>>>>> OK for mainline?
>>>>>>>
>>>>>>> Signed-off-by: Jennifer Schmitz <jschm...@nvidia.com>
>>>>>>>
>>>>>>> gcc/
>>>>>>> PR target/113816
>>>>>>> * config/aarch64/aarch64-simd.md (reduc_<optab>_scal_<mode>):
>>>>>>> Implement for logical bitwise operations for VDQV_E.
>>>>>>>
>>>>>>> gcc/testsuite/
>>>>>>> PR target/113816
>>>>>>> * lib/target-supports.exp (vect_logical_reduc): Add aarch64*.
>>>>>>> * gcc.target/aarch64/simd/logical_reduc.c: New test.
>>>>>>> * gcc.target/aarch64/vect-reduc-or_1.c: Adjust expected outcome.
>>>>>>> ---
>>>>>>> gcc/config/aarch64/aarch64-simd.md | 55 +++++
>>>>>>> .../gcc.target/aarch64/simd/logical_reduc.c | 208 ++++++++++++++++++
>>>>>>> .../gcc.target/aarch64/vect-reduc-or_1.c | 2 +-
>>>>>>> gcc/testsuite/lib/target-supports.exp | 4 +-
>>>>>>> 4 files changed, 267 insertions(+), 2 deletions(-)
>>>>>>> create mode 100644 gcc/testsuite/gcc.target/aarch64/simd/logical_reduc.c
>>>>>>>
>>>>>>> diff --git a/gcc/config/aarch64/aarch64-simd.md
>>>>> b/gcc/config/aarch64/aarch64-simd.md
>>>>>>> index 23c03a96371..00286b8b020 100644
>>>>>>> --- a/gcc/config/aarch64/aarch64-simd.md
>>>>>>> +++ b/gcc/config/aarch64/aarch64-simd.md
>>>>>>> @@ -3608,6 +3608,61 @@
>>>>>>> }
>>>>>>> )
>>>>>>>
>>>>>>> +;; Emit a sequence for bitwise logical reductions over vectors for
>>>>>>> V8QI, V16QI,
>>>>>>> +;; V4HI, and V8HI modes. The reduction is achieved by iteratively
>>>>>>> operating
>>>>>>> +;; on the two halves of the input.
>>>>>>> +;; If the input has 128 bits, the first operation is performed in
>>>>>>> vector
>>>>>>> +;; registers. From 64 bits down, the reduction steps are performed in
>>>>>>> general
>>>>>>> +;; purpose registers.
>>>>>>> +;; For example, for V8HI and operation AND, the intended sequence is:
>>>>>>> +;; EXT v1.16b, v0.16b, v0.16b, #8
>>>>>>> +;; AND v0.8b, v1.8b, v0.8b
>>>>>>> +;; FMOV x0, d0
>>>>>>> +;; AND x0, x0, x0, 32
>>>>>>> +;; AND w0, w0, w0, 16
>>>>>>> +;;
>>>>>>> +;; For V8QI and operation AND, the sequence is:
>>>>>>> +;; AND x0, x0, x0, lsr 32
>>>>>>> +;; AND w0, w0, w0, lsr, 16
>>>>>>> +;; AND w0, w0, w0, lsr, 8
>>>>>>> +
>>>>>>> +(define_expand "reduc_<optab>_scal_<mode>"
>>>>>>> + [(match_operand:<VEL> 0 "register_operand")
>>>>>>> + (LOGICAL:VDQV_E (match_operand:VDQV_E 1 "register_operand"))]
>>>>>>> + "TARGET_SIMD"
>>>>>>> + {
>>>>>>> + rtx dst = operands[1];
>>>>>>> + rtx tdi = gen_reg_rtx (DImode);
>>>>>>> + rtx tsi = lowpart_subreg (SImode, tdi, DImode);
>>>>>>> + rtx op1_lo;
>>>>>>> + if (known_eq (GET_MODE_SIZE (<MODE>mode), 16))
>>>>>>> + {
>>>>>>> + rtx t0 = gen_reg_rtx (<MODE>mode);
>>>>>>> + rtx t1 = gen_reg_rtx (DImode);
>>>>>>> + rtx t2 = gen_reg_rtx (DImode);
>>>>>>> + rtx idx = GEN_INT (8 / GET_MODE_UNIT_SIZE (<MODE>mode));
>>>>>>> + emit_insn (gen_aarch64_ext<mode> (t0, dst, dst, idx));
>>>>>>> + op1_lo = lowpart_subreg (V2DImode, dst, <MODE>mode);
>>>>>>> + rtx t0_lo = lowpart_subreg (V2DImode, t0, <MODE>mode);
>>>>>>> + emit_insn (gen_aarch64_get_lanev2di (t1, op1_lo, GEN_INT (0)));
>>>>>>> + emit_insn (gen_aarch64_get_lanev2di (t2, t0_lo, GEN_INT (0)));
>>>>>>> + emit_insn (gen_<optab>di3 (t1, t1, t2));
>>>>>>> + emit_move_insn (tdi, t1);
>>>>>>> + }
>>>>>>> + else
>>>>>>> + {
>>>>>>> + op1_lo = lowpart_subreg (DImode, dst, <MODE>mode);
>>>>>>> + emit_move_insn (tdi, op1_lo);
>>>>>>> + }
>>>>>>> + emit_insn (gen_<optab>_lshrdi3 (tdi, tdi, GEN_INT (32), tdi));
>>>>>>> + emit_insn (gen_<optab>_lshrsi3 (tsi, tsi, GEN_INT (16), tsi));
>>>>>>> + if (known_eq (GET_MODE_UNIT_BITSIZE (<MODE>mode), 8))
>>>>>>> + emit_insn (gen_<optab>_lshrsi3 (tsi, tsi, GEN_INT (8), tsi));
>>>>>>> + emit_move_insn (operands[0], lowpart_subreg (<VEL>mode, tsi,
>>>>>>> SImode));
>>>>>>> + DONE;
>>>>>>> + }
>>>>>>> +)
>>>>>>
>>>>>> Sorry to be awkward, but I've got mixed feelings about this.
>>>>>> The sequence produced looks good. But we're not defining this
>>>>>> optab because the target has a special instruction for implementing
>>>>>> the operation. We're just implementing it to tweak the loop emitted
>>>>>> by vect_create_epilog_for_reduction. Maybe we should change that loop
>>>>>> instead, perhaps with a target hook?
>>>>>
>>>>> So does it currently not work because there's no v8qi, v4qi or v2qi
>>>>> vector modes? I'll note there already _is_
>>>>> targetm.vectorize.split_reduction (mode), but it expects a vector
>>>>> mode as result and it doesn't get the operation in.
>>>>>
>>>
>>> The point of the reduction is that they shouldn't be done on the vector
>>> side.
>>> So even if we have v4qi and v2qi the reduction shouldn't be done there as
>>> the latency and throughput for these sub 64-bit operations on the vector
>>> side is much higher and so should be done on the integer side. i.e. these
>>> should be done on DI and SI.
>>>
>>> This is why I disagree with Richard that this can be done generically for
>>> all
>>> targets. The decomposition takes advantage of Arm's fused shifted logical
>>> operations on GPR. This makes them really cheap to do on integer.
>>
>> Wouldn't it still be a win without that though? Having the fused operation
>> gives us a ~4x improvement in latency on a typical core (EXT + vector AND
>> -> fused scalar AND). But the unfused operation would still be a ~2x
>> improvement.
>>
>>> Richard brought up AArch32 as another example. I don't agree there either
>>> Because AArch32's register file overlap means you don't need shifts.
>>> The first step reduction can just be done by accessing the registers
>>> as even/odd pairs on a smaller mode.
>>>
>>> To get this codegen generically the targets must supports modes tie-able
>>> between the vector and smaller vector size. i.e. BIT_FIELD_REF on the
>>> vector mode to a smaller mode must be cheap or free.
>>>
>>> The decision for when you do the transfer from SIMD to GPR is highly
>>> target specific, and the how as well.
>>>
>>>> I would expect that we can improve on this existing machinery
>>>> instead of inventing a new one.
>>>>
>>>
>>> But we're not. We're just implementing the optab to the machinery
>>> that already exists.
>>
>> I don't think this kind of expansion is what optabs are there for though.
>> I suppose it was in the pre-gimple days, when all optimisation was done
>> on RTL. (Although of course that also predates vectorisation.) But now,
>> I think the purpose of optabs is to advertise what the target can do via
>> some form of direct h/w support.
>>
>> The original patch (using ANDV) was an ideal use of optabs. But I think
>> the request to use a tweaked version of the current open-coded reduction
>> scheme, although a good suggestion, also changes the way that we should
>> go about it.
>>
>>> I think we're just over engineering a new contributor's patch. Even if
>>> we think there is some merit to do it generically, I don't see why we
>>> shouldn't take this patch today.
>>
>> Sorry if it comes across that way.
> Hi everyone,
> Thanks for the discussion around my patch. I will certainly be happy to look
> at alternative implementation locations for the optimization, also on the
> gimple level.
> Could you offer some more guidance which direction to pursue, whether to
> extend targetm.vectorize.split_reduction (mode) or to make changes to
> vect_create_epilog_for_reduction? I’m not familiar enough with these parts of
> the code yet to decide which one is more suitable.
You need to touch both, making the hook suitable and producing scalar modes.
Richard
> Thanks and have a nice weekend,
> Jennifer
>>
>> Thanks,
>> Richard
>
>