> On 2 Oct 2024, at 14:34, Tamar Christina <tamar.christ...@arm.com> wrote:
> 
> External email: Use caution opening links or attachments
> 
> 
>> -----Original Message-----
>> From: Kyrylo Tkachov <ktkac...@nvidia.com>
>> Sent: Wednesday, October 2, 2024 1:09 PM
>> To: Richard Sandiford <richard.sandif...@arm.com>
>> Cc: Tamar Christina <tamar.christ...@arm.com>; Jennifer Schmitz
>> <jschm...@nvidia.com>; gcc-patches@gcc.gnu.org; Kyrylo Tkachov
>> <ktkac...@exchange.nvidia.com>
>> Subject: Re: [PATCH] [PR113816] AArch64: Use SVE bit op reduction for vector
>> reductions
>> 
>> 
>> 
>>> On 2 Oct 2024, at 13:43, Richard Sandiford <richard.sandif...@arm.com>
>> wrote:
>>> 
>>> External email: Use caution opening links or attachments
>>> 
>>> 
>>> Tamar Christina <tamar.christ...@arm.com> writes:
>>>> Hi Jennifer,
>>>> 
>>>>> -----Original Message-----
>>>>> From: Richard Sandiford <richard.sandif...@arm.com>
>>>>> Sent: Tuesday, October 1, 2024 12:20 PM
>>>>> To: Jennifer Schmitz <jschm...@nvidia.com>
>>>>> Cc: gcc-patches@gcc.gnu.org; Kyrylo Tkachov
>> <ktkac...@exchange.nvidia.com>
>>>>> Subject: Re: [PATCH] [PR113816] AArch64: Use SVE bit op reduction for 
>>>>> vector
>>>>> reductions
>>>>> 
>>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>>> This patch implements the optabs reduc_and_scal_<mode>,
>>>>>> reduc_ior_scal_<mode>, and reduc_xor_scal_<mode> for Advanced SIMD
>>>>>> integers for TARGET_SVE in order to use the SVE instructions ANDV, ORV, 
>>>>>> and
>>>>>> EORV for fixed-width bitwise reductions.
>>>>>> For example, the test case
>>>>>> 
>>>>>> int32_t foo (int32_t *a)
>>>>>> {
>>>>>> int32_t b = -1;
>>>>>> for (int i = 0; i < 4; ++i)
>>>>>>   b &= a[i];
>>>>>> return b;
>>>>>> }
>>>>>> 
>>>>>> was previously compiled to
>>>>>> (-O2 -ftree-vectorize --param aarch64-autovec-preference=asimd-only):
>>>>>> foo:
>>>>>>       ldp     w2, w1, [x0]
>>>>>>       ldp     w3, w0, [x0, 8]
>>>>>>       and     w1, w1, w3
>>>>>>       and     w0, w0, w2
>>>>>>       and     w0, w1, w0
>>>>>>       ret
>>>>>> 
>>>>>> With patch, it is compiled to:
>>>>>> foo:
>>>>>>       ldr     q31, [x0]
>>>>>>  ptrue   p7.b, all
>>>>>>  andv    s31, p7, z31.s
>>>>>>  fmov    w0, s3
>>>>>>  ret
>>>>>> 
>>>>>> Test cases were added to check the produced assembly for use of SVE
>>>>>> instructions.
>>>>> 
>>>>> I would imagine that in this particular case, the scalar version is
>>>>> better.  But I agree it's a useful feature for other cases.
>>>>> 
>>>> 
>>>> Yeah, I'm concerned because ANDV and other reductions are extremely
>> expensive.
>>>> But assuming the reductions are done outside of a loop then it should be 
>>>> ok,
>> though.
>>>> 
>>>> The issue is that the reduction latency grows with VL, so e.g. compare the
>> latencies and
>>>> throughput for Neoverse V1 and Neoverse V2.  So I think we want to gate 
>>>> this
>> on VL128.
>>>> 
>>>> As an aside, is the sequence correct?  With ORR reduction ptrue makes 
>>>> sense,
>> but for
>>>> VL > 128 ptrue doesn't work as the top bits would be zero. So an ANDV on 
>>>> zero
>> values
>>>> lanes would result in zero.
>>> 
>>> Argh!  Thanks for spotting that.  I'm kicking myself for missing it :(
>>> 
>>>> You'd want to predicate the ANDV with the size of the vector being reduced.
>> The same
>>>> is true for SMIN and SMAX.
>>>> 
>>>> I do wonder whether we need to split the pattern into two, where w->w uses
>> the SVE
>>>> Instructions but w->r uses Adv SIMD.
>>>> 
>>>> In the case of w->r as the example above
>>>> 
>>>>       ext     v1.16b, v0.16b, v0.16b, #8
>>>>       and     v0.8b, v0.8b, v1.8b
>>>>       fmov    x8, d0
>>>>       lsr     x9, x8, #32
>>>>       and     w0, w8, w9
>>>> 
>>>> would beat the ADDV on pretty much every uarch.
>>>> 
>>>> But I'll leave it up to the maintainers.
>>> 
>>> Also a good point.  And since these are integer reductions, an r
>>> result is more probable than a w result.  w would typically only
>>> be used if the result is stored directly to memory.
>>> 
>>> At which point, the question (which you might have been implying)
>>> is whether it's worth doing this at all, given the limited cases
>>> for which it's beneficial, and the complication that's needed to
>>> (a) detect those cases and (b) make them work.
>> 
>> These are good points in the thread. Maybe it makes sense to do this only for
>> V16QI reductions?
>> Maybe a variant of Tamar’s w->r sequence wins out even there.
> 
> I do agree that they're worth implementing, and also for 64-bit vectors 
> (there you
> Skip the first reduction and just fmov the value to gpr since you don't have 
> the
> Initial 128 -> 64 bit reduction step),
> 
> But I think at the moment they're possibly not modelled as reductions in our
> cost model.  Like Richard mentioned I don't think the low iteration cases
> should vectorize and instead just unroll.
> 
>> 
>> Originally I had hoped that we’d tackle the straight-line case from PR113816 
>> but it
>> seems that GCC didn’t even try to create a reduction op for the code there.
>> Maybe that’s something to look into separately.
> 
> Yeah, I think unrolled scalar is going to beat the ORV there as you can have 
> better
> throughput doing the reductions in pairs.
> 
>> 
>> Also, for the alternative test case that we tried to use for a motivation:
>> char sior_loop (char *a)
>> {
>>  char b = 0;
>>  for (int i = 0; i < 16; ++i)
>>    b |= a[i];
>>  return b;
>> }
>> 
>> GCC generates some terrible code: https://godbolt.org/z/a68jodKca
>> So it feels that we should do something to improve the bitwise reductions for
>> AArch64 regardless.
>> Maybe we need to agree on the optimal sequences for the various modes and
>> implement them.
> 
> Yeah I do think they're worth having! Because the vectorizer's fallback method
> of doing these reductions without the optab is element-wise but on the SIMD
> side.  I think the optimal sequence for all of them are a mix of SIMD + GPR.
> 
I revised the patch and resubmitted in 
https://gcc.gnu.org/pipermail/gcc-patches/2024-October/664981.html.
Best,
Jennifer
> Cheers,
> Tamar
> 
>> 
>> Thanks,
>> Kyrill
>> 
>>> 
>>> Thanks,
>>> Richard
>>> 
>>>> 
>>>> Thanks for the patch though!
>>>> 
>>>> Cheers,
>>>> Tamar
>>>> 
>>>>> However, a natural follow-on would be to use SMAXV and UMAXV for V2DI.
>>>>> Taking that into account:
>>>>> 
>>>>>> [...]
>>>>>> diff --git a/gcc/config/aarch64/aarch64-sve.md
>> b/gcc/config/aarch64/aarch64-
>>>>> sve.md
>>>>>> index bfa28849adf..0d9e5cebef0 100644
>>>>>> --- a/gcc/config/aarch64/aarch64-sve.md
>>>>>> +++ b/gcc/config/aarch64/aarch64-sve.md
>>>>>> @@ -8927,6 +8927,28 @@
>>>>>>  "<su>addv\t%d0, %1, %2.<Vetype>"
>>>>>> )
>>>>>> 
>>>>>> +;; Unpredicated logical integer reductions for Advanced SIMD modes.
>>>>>> +(define_expand "reduc_<optab>_scal_<mode>"
>>>>>> +  [(set (match_operand:<VEL> 0 "register_operand")
>>>>>> +  (unspec:<VEL> [(match_dup 2)
>>>>>> +                 (match_operand:VQ_I 1 "register_operand")]
>>>>>> +                SVE_INT_REDUCTION_LOGICAL))]
>>>>>> +  "TARGET_SVE"
>>>>>> +  {
>>>>>> +    operands[2] = aarch64_ptrue_reg (<VPRED>mode);
>>>>>> +  }
>>>>>> +)
>>>>>> +
>>>>>> +;; Predicated logical integer reductions for Advanced SIMD modes.
>>>>>> +(define_insn "*aarch64_pred_reduc_<optab>_<mode>"
>>>>>> +  [(set (match_operand:<VEL> 0 "register_operand" "=w")
>>>>>> +  (unspec:<VEL> [(match_operand:<VPRED> 1 "register_operand" "Upl")
>>>>>> +                 (match_operand:VQ_I 2 "register_operand" "w")]
>>>>>> +                SVE_INT_REDUCTION_LOGICAL))]
>>>>>> +  "TARGET_SVE"
>>>>>> +  "<sve_int_op>\t%<Vetype>0, %1, %Z2.<Vetype>"
>>>>>> +)
>>>>>> +
>>>>> 
>>>>> ...I think we should avoid adding more patterns, and instead extend
>>>>> the existing:
>>>>> 
>>>>> (define_expand "reduc_<optab>_scal_<mode>"
>>>>> [(set (match_operand:<VEL> 0 "register_operand")
>>>>>    (unspec:<VEL> [(match_dup 2)
>>>>>                   (match_operand:SVE_FULL_I 1 "register_operand")]
>>>>>                  SVE_INT_REDUCTION))]
>>>>> "TARGET_SVE"
>>>>> {
>>>>>   operands[2] = aarch64_ptrue_reg (<VPRED>mode);
>>>>> }
>>>>> )
>>>>> 
>>>>> to all Advanced SIMD integer modes except V1DI.  This would involve
>>>>> adding a new mode iterator along the same lines as SVE_FULL_SDI_SIMD
>>>>> (SVE_FULL_I_SIMD?).
>>>>> 
>>>>> Then we could change the name of:
>>>>> 
>>>>> (define_expand "reduc_<optab>_scal_<mode>"
>>>>> [(match_operand:<VEL> 0 "register_operand")
>>>>>  (unspec:VDQ_BHSI [(match_operand:VDQ_BHSI 1 "register_operand")]
>>>>>                MAXMINV)]
>>>>> "TARGET_SIMD"
>>>>> {
>>>>>   rtx elt = aarch64_endian_lane_rtx (<MODE>mode, 0);
>>>>>   rtx scratch = gen_reg_rtx (<MODE>mode);
>>>>>   emit_insn (gen_aarch64_reduc_<optab>_internal<mode> (scratch,
>>>>>                                                     operands[1]));
>>>>>   emit_insn (gen_aarch64_get_lane<mode> (operands[0], scratch, elt));
>>>>>   DONE;
>>>>> }
>>>>> )
>>>>> 
>>>>> to something like:
>>>>> 
>>>>> (define_expand "@aarch64_advsimd_<optab><mode>"
>>>>> 
>>>>> The expander above would then be something like:
>>>>> 
>>>>> (define_expand "reduc_<optab>_scal_<mode>"
>>>>> [(set (match_operand:<VEL> 0 "register_operand")
>>>>>    (unspec:<VEL> [(match_dup 2)
>>>>>                   (match_operand:SVE_FULL_I_SIMD 1 "register_operand")]
>>>>>                  SVE_INT_REDUCTION))]
>>>>> "TARGET_SVE
>>>>>  || (maybe_code_for_aarch64_advsimd (<SVE_INT_REDUCTION>,
>>>>> <MODE>mode)
>>>>>      != CODE_FOR_nothing)"
>>>>> {
>>>>>   if (rtx pat = maybe_gen_aarch64_advsimd (<SVE_INT_REDUCTION>,
>>>>> <MODE>mode))
>>>>>     {
>>>>>       emit_insn (pat);
>>>>>       DONE;
>>>>>     }
>>>>>   operands[2] = aarch64_ptrue_reg (<VPRED>mode);
>>>>> }
>>>>> )
>>>>> 
>>>>> which acts as a kind of union pattern for SVE and Advanced SIMD.
>>>>> 
>>>>> This would handle the SMAXV and UMAXV cases as well as the ANDV, ORV
>>>>> and EORV cases.
>>>>> 
>>>>> Not tested, so let me know if it doesn't work :)
>>>>> 
>>>>> Thanks,
>>>>> Richard


Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to