Re: [PATCH 2/3] AArch64: Implement target hooks for dispatch scheduling.

Tamar Christina Fri, 29 Aug 2025 03:43:20 -0700

The 08/20/2025 16:52, Jennifer Schmitz wrote:
> 
> 
> > On 5 Aug 2025, at 09:25, Tamar Christina <[email protected]> wrote:
> >
> > External email: Use caution opening links or attachments
> >
> >
> > Hi Jennifer,
> >
> > This part is new to me so I'm just gonna go by the documentation and code 
> > for now,
> > so I'll have some questions.
> >
> > I'll also comment on things related to all three patches here.
> Hi Tamar,
> Thanks you for the helpful feedback!
> I revised the patches to improve the representation of the dispatch 
> constraints and am attaching the updated version all in this mail (I repeated 
> regression-testing and performance evaluation).
> 
> The two main changes are:
> - Instead of representing the dispatch constraints as an array of 
> dispatch_constraint objects, they are now represented in the following form: 
> The number of maximum and free slots are two integer arrays that the 
> dispatch_window manages. To avoid traversing all dispatch constraints 
> multiple times when adding an insn to a window, I added a callback in the 
> tuning model that takes an insn as input an returns a vector of pairs (index, 
> slots) that can be used by dispatch_window to check and make changes to the 
> slot arrays. This callback is only called once when an insn is added.
> - The IS_CMP interface with the Haifa scheduler now returns always false 
> instead of checking if the insn is a compare insn.
>


Hmm I think that would work, I guess it effectively means no special scheduling 
of these instructions.

> >> +
> >> +/* Return TRUE iff INSN fits into the dispatch window according to
> >> +   to all constraints.  */
> >> +bool
> >> +dispatch_window::fits_dispatch_window (rtx_insn *insn) const
> >> +{
> >> +  if (INSN_CODE (insn) >= 0)
> >> +    {
> >> +      if (dump_file)
> >> +     {
> >> +       fprintf (dump_file, "Checking if insn fits into dispatch 
> >> window:\n");
> >> +       print_rtl_single (dump_file, insn);
> >> +     }
> >> +      for (int i = 0; i < m_num_constraints; i++)
> >> +     {
> >> +       if (!m_constraints[i].can_fit_insn (insn))
> >> +         return false;
> >> +     }
> >
> > This and the other usages of the class makes me wonder if this 
> > representation is
> > the right one, because it seems like for every scheduling of instruction, 
> > we walk
> > the number of constraints at least twice.  i.e. in add_insn_to_window which 
> > calls
> > fits_dispatch_window we do a linear scan over each pipeline twice. Even 
> > though
> > num_occupied_slots is essentially classifying the instruction in the exact 
> > constraint
> > already.
> >
> > I think we can avoid this by choosing a different representation.
> > How about instead of representing each constraint as its own class, you 
> > represent the
> > totality of constraints as one entity.
> >
> > That is, you define
> >
> > int constraint_slots[NUM_CONSTRAINTS];
> >
> > and you use as the indexes of the array the enums for the constraints. So 
> > you abstract
> > and iterate over the results of get_attr_neoversev2_dispatch.
> >
> > A reset would then just be a memcpy of the default state and on updating of 
> > a slot
> > anytime you become negative you've got a violation.
> >
> > So you then only have to visit the set of slots the instruction actually 
> > uses instead of all
> > slots.  That should also simplify the creation of the constraints and the 
> > classification of
> > how many slots they use, since we don't need to know.   I think you can 
> > then do away
> > with the dispatch_constrains class entirely as dispatch_window can do all 
> > the work
> > generically with just a pointer to the state class to update and check.
> Do the changes in the updated version cover what you meant here?

Yup! That's pretty nice! I'll leave some inline comments..

> >> +
> >> +/* Return TRUE if INSN is a comparison instruction.  */
> >> +bool
> >> +is_cmp_insn (rtx_insn *insn)
> >> +{
> >> +  enum attr_type type;
> >> +  type = get_attr_type (insn);
> >> +  return (type == TYPE_FCMPS
> >> +       || type == TYPE_FCMPD
> >> +       || type == TYPE_FCCMPS
> >> +       || type == TYPE_FCCMPD
> >> +       || GET_CODE (PATTERN (insn)) == COMPARE);
> >> +}
> >
> > My understanding as for the reason for the scheduler to ask for compares
> > is that it wants to know about flag setting instructions compared to their 
> > consumers.
> >
> > So should is_cmp_insn here also include SVE flag setting instructions?
> > i.e. ORRS, integer compares, MATCH etc.
> >
> > I see on x86 (the only other target that seems to have this implemented)
> > That they delay scheduling of compare related instructions in their dispatch
> > Windows by always returning false, such that they are flushed last.  I think
> > this is with the idea that normally branches are not part of the window as 
> > they
> > are required to be last and delaying the compares you always keep them next
> > to their consumers.
> If I understand the x86 code correctly, they return always false for compare 
> instructions (disp_cmp) in their fits_dispatch_window function, i.e. compare 
> instructions never fit into the window. In the IS_CMP query from the Haifa 
> scheduler they return true for compare instructions and else false.
> That results in compare instructions to be delayed.
> In the aarch64 dispatch scheduler, the fits_window function does not do any 
> such check for compare instructions.
> >
> > That won't entirely work for us as we have many instructions consuming the
> > Flags that are not branches (i.e CSEL).
> >
> > So I wonder if we possibly are not breaking instruction fusion atm.  I 
> > guess one
> > way around it would be to delay both the flag setting and flag consuming 
> > instructions
> > by not scheduling them?
> >
> > But this becomes a problem with say vector compares which have a limited 
> > throughput
> > and would ideally benefit from dispatch based scheduling.
> >
> > Not sure what the best solution here is, so open to suggestions!
> In the updated version, I opted to return always false for the IS_CMP query, 
> because I agree that we have multiple flag-consuming instructions that should 
> not be delayed by moving the compare instructions towards the end of a block.
> What do you think?

I'm still trying to work out what would happen if you have a sequence like

V1 BEXT
V1 BEXT
V0 CMPEQ
B csel

So still writing some RTL examples to get a better understanding of how this 
works out.  In the meantime however...

In patch 2:

+
> +/* Constructor for class dispatch_window.  */
> +dispatch_window::dispatch_window (const int *max_slots, int num_constraints,
> +                               vec<std::pair<int, int>> (*callback) 
> (rtx_insn *))
> +  : m_max_slots (max_slots), m_num_constraints (num_constraints),
> +    m_callback (callback), m_violation (false)
> +{
> +  m_free_slots = XNEWVEC (int, num_constraints);
> +  reset_constraints ();
> +}
> +

Why not pass dispatch_constraint_info directly by reference here?

> +  for (unsigned i = 0; i < constraints.length (); i++)
> +    {
> +      const auto &constraint = constraints[i];

More of a stylistic issue, why not use a for-each iterator

  for (const auto &constraint : constraints)

Same with the other places we iterate.

> +  for (unsigned i = 0; i < constraints.length (); i++)
> +    {
> +      const auto &constraint = constraints[i];
> +      if (constraint.second > 0)
> +     {

Can this ever not be true? if the instruction doesn't use the slot
it wouldn't be in the constraints list with this version no?

In patch 3:

> +static vec<std::pair<int, int>>
> +neoversev2_dispatch_constraint_callback (rtx_insn *insn)
> +{
> +  auto dispatch_group = get_attr_neoversev2_dispatch (insn);
> +  vec<std::pair<int, int>> constraints = vNULL;
> +
> +  /* total slots  */
> +  constraints.safe_push (std::make_pair (0, 1));
> +
> +  switch (dispatch_group)
> +    {
> +    case NEOVERSEV2_DISPATCH_NONE:
> +      break;
> +
> +    case NEOVERSEV2_DISPATCH_BS01:
> +      constraints.safe_push (std::make_pair (1, 1));  /* b, s0, s1 
> pipelines.  */
> +      constraints.safe_push (std::make_pair (4, 1));  /* b, s, m pipelines.  
> */
> +      break;
> +

You can use brace initialization here, e.g.


    case NEOVERSEV2_DISPATCH_BS01:
      constraints.safe_push ({1, 1});  /* b, s0, s1 pipelines.  */
      constraints.safe_push ({4, 1});  /* b, s, m pipelines.  */
      break;

However the vast majority of these are 1, except for 2 entries

> +
> +    case NEOVERSEV2_DISPATCH_V_V13:
> +      constraints.safe_push (std::make_pair (7, 2));  /* v pipelines (2 
> slots).  */
> +      break;
> +

like this one.  How about dropping the std::pair and instead just adding the 
element twice to
the constaints list.

    case NEOVERSEV2_DISPATCH_V_V13:
      constraints.safe_push (7);  /* v pipelines.  */
      constraints.safe_push (7);  /* v pipelines.  */
      break;

That would simplify the code in patch 2 slightly too.

I think we should also use an enum here (just use implicit conversion to cast 
the enum back to int)
such that the numbers mean something.

    case NEOVERSEV2_DISPATCH_V_V13:
      constraints.safe_push (V_PIPE);  /* v pipelines.  */
      constraints.safe_push (V_PIPE);  /* v pipelines.  */
      break;

Aside from these I think we're mostly there.

Thanks,
Tamar

Re: [PATCH 2/3] AArch64: Implement target hooks for dispatch scheduling.

Reply via email to