Re: [RFC][PATCH] SVE intrinsics: Fold svdiv (svptrue, x, x) to ones

Richard Sandiford Tue, 06 Aug 2024 03:44:15 -0700

Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>> On 5 Aug 2024, at 18:00, Richard Sandiford <richard.sandif...@arm.com> wrote:
>> 
>> External email: Use caution opening links or attachments
>> 
>> 
>> Kyrylo Tkachov <ktkac...@nvidia.com> writes:
>>>> On 5 Aug 2024, at 12:01, Richard Sandiford <richard.sandif...@arm.com> 
>>>> wrote:
>>>> 
>>>> External email: Use caution opening links or attachments
>>>> 
>>>> 
>>>> Jennifer Schmitz <jschm...@nvidia.com> writes:
>>>>> This patch folds the SVE intrinsic svdiv into a vector of 1's in case
>>>>> 1) the predicate is svptrue and
>>>>> 2) dividend and divisor are equal.
>>>>> This is implemented in the gimple_folder for signed and unsigned
>>>>> integers. Corresponding test cases were added to the existing test
>>>>> suites.
>>>>> 
>>>>> The patch was bootstrapped and regtested on aarch64-linux-gnu, no 
>>>>> regression.
>>>>> OK for mainline?
>>>>> 
>>>>> Please also advise whether it makes sense to implement the same 
>>>>> optimization
>>>>> for float types and if so, under which conditions?
>>>> 
>>>> I think we should instead use const_binop to try to fold the division
>>>> whenever the predicate is all-true, or if the function uses _x predication.
>>>> (As a follow-on, we could handle _z and _m too, using VEC_COND_EXPR.)
>>>> 
>>> 
>>> From what I can see const_binop only works on constant arguments.
>> 
>> Yeah, it only produces a result for constant arguments.  I see now
>> that that isn't the case that the patch is interested in, sorry.
>> 
>>> Is fold_binary a better interface to use ? I think it’d hook into the 
>>> match.pd machinery for divisions at some point.
>> 
>> We shouldn't use that from gimple folders AIUI, but perhaps I misremember.
>> (I realise we'd be using it only to test whether the result is constant,
>> but even so.)
>> 
>> Have you (plural) come across a case where svdiv is used with equal
>> non-constant arguments?  If it's just being done on first principles
>> then how about starting with const_binop instead?  If possible, it'd be
>> good to structure it so that we can reuse the code for svadd, svmul,
>> svsub, etc.
>
> We’ve had a bit of internal discussion on this to get our ducks in a row.
> We are interested in having more powerful folding of SVE intrinsics generally 
> and we’d like some advice on how best to approach this.
> Prathamesh suggested adding code to fold intrinsics to standard GIMPLE codes 
> where possible when they are _x-predicated or have a ptrue predicate. 
> Hopefully that would allow us to get all the match.pd and fold-const.cc 
> <http://fold-const.cc/> optimizations “for free”.
> Would that be a reasonable direction rather than adding custom folding code 
> to individual intrinsics such as svdiv?
> We’d need to ensure that the midend knows how to expand such GIMPLE codes 
> with VLA types and that the required folding rules exist in match.pd (though 
> maybe they work already for VLA types?)


Expansion shouldn't be a problem, since we already rely on that for
autovectorisation.

But I think this comes back to what we discussed earlier, in the context
of whether we should replace divisions by constants with multi-instruction
alternatives.  My comment there was:

  I'm a bit uneasy about going that far.  I suppose it comes down to a
  question about what intrinsics are for.  Are they for describing an
  algorithm, or for hand-optimising a specific implementation of the
  algorithm?  IMO it's the latter.

  If people want to write out a calculation in natural arithmetic, it
  would be better to write the algorithm in scalar code and let the
  vectoriser handle it.  That gives the opportunity for many more
  optimisations than just this one.

  Intrinsics are about giving programmers direct, architecture-level
  control over how something is implemented.  I've seen Arm's library
  teams go to great lengths to work out which out of a choice of
  instruction sequences is the best one, even though the sequences in
  question would look functionally equivalent to a smart-enough compiler.

  So part of the work of using intrinsics is to figure out what the best
  sequence is.  And IMO, part of the contract is that the compiler
  shouldn't interfere with the programmer's choices too much.  If the
  compiler makes a change, it must very confident that it is a win for
  the function as a whole.

  Replacing one division with one shift is fine, as an aid to the programmer.
  It removes the need for (say) templated functions to check for that case
  manually.  Constant folding is fine too, for similar reasons.  In these
  cases, there's not really a cost/benefit choice to be made between
  different expansions.  One choice is objectively better in all
  realistic situations.

  But when it comes to general constants, there are many different choices
  that could be made when deciding which constants should be open-coded
  and which shouldn't.  IMO we should leave the choice to the programmer
  in those cases.  If the compiler gets it wrong, there will be no way
  for the programmer to force the compiler's hand ("no, when I say svdiv,
  I really do mean svdiv").

If we just replace svmul and svdiv with MULT_EXPR and *DIV_EXPR,
we'd be discarding the user's instruction choices and imposing our own.

FWIW, Tejas is looking at adding support for C/C++ operators on VLA
vectors (__ARM_FEATURE_SVE_VECTOR_OPERATORS).  That would then give
the user the choice of writing the arithmetic "naturally" or using
intrinsics.  The former is better for users who want the compiler to choose
the instructions, while the latter is betterfor users who want to control
the implementation themselves.

Thanks,
Richard

Re: [RFC][PATCH] SVE intrinsics: Fold svdiv (svptrue, x, x) to ones

Reply via email to