The 08/20/2025 16:52, Jennifer Schmitz wrote:
>
>
> > On 5 Aug 2025, at 09:25, Tamar Christina <[email protected]> wrote:
> >
> > External email: Use caution opening links or attachments
> >
> >
> > Hi Jennifer,
> >
> > This part is new to me so I'm just gonna go by the documentation and code
> > for now,
> > so I'll have some questions.
> >
> > I'll also comment on things related to all three patches here.
> Hi Tamar,
> Thanks you for the helpful feedback!
> I revised the patches to improve the representation of the dispatch
> constraints and am attaching the updated version all in this mail (I repeated
> regression-testing and performance evaluation).
>
> The two main changes are:
> - Instead of representing the dispatch constraints as an array of
> dispatch_constraint objects, they are now represented in the following form:
> The number of maximum and free slots are two integer arrays that the
> dispatch_window manages. To avoid traversing all dispatch constraints
> multiple times when adding an insn to a window, I added a callback in the
> tuning model that takes an insn as input an returns a vector of pairs (index,
> slots) that can be used by dispatch_window to check and make changes to the
> slot arrays. This callback is only called once when an insn is added.
> - The IS_CMP interface with the Haifa scheduler now returns always false
> instead of checking if the insn is a compare insn.
>
Hmm I think that would work, I guess it effectively means no special scheduling
of these instructions.
> >> +
> >> +/* Return TRUE iff INSN fits into the dispatch window according to
> >> + to all constraints. */
> >> +bool
> >> +dispatch_window::fits_dispatch_window (rtx_insn *insn) const
> >> +{
> >> + if (INSN_CODE (insn) >= 0)
> >> + {
> >> + if (dump_file)
> >> + {
> >> + fprintf (dump_file, "Checking if insn fits into dispatch
> >> window:\n");
> >> + print_rtl_single (dump_file, insn);
> >> + }
> >> + for (int i = 0; i < m_num_constraints; i++)
> >> + {
> >> + if (!m_constraints[i].can_fit_insn (insn))
> >> + return false;
> >> + }
> >
> > This and the other usages of the class makes me wonder if this
> > representation is
> > the right one, because it seems like for every scheduling of instruction,
> > we walk
> > the number of constraints at least twice. i.e. in add_insn_to_window which
> > calls
> > fits_dispatch_window we do a linear scan over each pipeline twice. Even
> > though
> > num_occupied_slots is essentially classifying the instruction in the exact
> > constraint
> > already.
> >
> > I think we can avoid this by choosing a different representation.
> > How about instead of representing each constraint as its own class, you
> > represent the
> > totality of constraints as one entity.
> >
> > That is, you define
> >
> > int constraint_slots[NUM_CONSTRAINTS];
> >
> > and you use as the indexes of the array the enums for the constraints. So
> > you abstract
> > and iterate over the results of get_attr_neoversev2_dispatch.
> >
> > A reset would then just be a memcpy of the default state and on updating of
> > a slot
> > anytime you become negative you've got a violation.
> >
> > So you then only have to visit the set of slots the instruction actually
> > uses instead of all
> > slots. That should also simplify the creation of the constraints and the
> > classification of
> > how many slots they use, since we don't need to know. I think you can
> > then do away
> > with the dispatch_constrains class entirely as dispatch_window can do all
> > the work
> > generically with just a pointer to the state class to update and check.
> Do the changes in the updated version cover what you meant here?
Yup! That's pretty nice! I'll leave some inline comments..
> >> +
> >> +/* Return TRUE if INSN is a comparison instruction. */
> >> +bool
> >> +is_cmp_insn (rtx_insn *insn)
> >> +{
> >> + enum attr_type type;
> >> + type = get_attr_type (insn);
> >> + return (type == TYPE_FCMPS
> >> + || type == TYPE_FCMPD
> >> + || type == TYPE_FCCMPS
> >> + || type == TYPE_FCCMPD
> >> + || GET_CODE (PATTERN (insn)) == COMPARE);
> >> +}
> >
> > My understanding as for the reason for the scheduler to ask for compares
> > is that it wants to know about flag setting instructions compared to their
> > consumers.
> >
> > So should is_cmp_insn here also include SVE flag setting instructions?
> > i.e. ORRS, integer compares, MATCH etc.
> >
> > I see on x86 (the only other target that seems to have this implemented)
> > That they delay scheduling of compare related instructions in their dispatch
> > Windows by always returning false, such that they are flushed last. I think
> > this is with the idea that normally branches are not part of the window as
> > they
> > are required to be last and delaying the compares you always keep them next
> > to their consumers.
> If I understand the x86 code correctly, they return always false for compare
> instructions (disp_cmp) in their fits_dispatch_window function, i.e. compare
> instructions never fit into the window. In the IS_CMP query from the Haifa
> scheduler they return true for compare instructions and else false.
> That results in compare instructions to be delayed.
> In the aarch64 dispatch scheduler, the fits_window function does not do any
> such check for compare instructions.
> >
> > That won't entirely work for us as we have many instructions consuming the
> > Flags that are not branches (i.e CSEL).
> >
> > So I wonder if we possibly are not breaking instruction fusion atm. I
> > guess one
> > way around it would be to delay both the flag setting and flag consuming
> > instructions
> > by not scheduling them?
> >
> > But this becomes a problem with say vector compares which have a limited
> > throughput
> > and would ideally benefit from dispatch based scheduling.
> >
> > Not sure what the best solution here is, so open to suggestions!
> In the updated version, I opted to return always false for the IS_CMP query,
> because I agree that we have multiple flag-consuming instructions that should
> not be delayed by moving the compare instructions towards the end of a block.
> What do you think?
I'm still trying to work out what would happen if you have a sequence like
V1 BEXT
V1 BEXT
V0 CMPEQ
B csel
So still writing some RTL examples to get a better understanding of how this
works out. In the meantime however...
In patch 2:
+
> +/* Constructor for class dispatch_window. */
> +dispatch_window::dispatch_window (const int *max_slots, int num_constraints,
> + vec<std::pair<int, int>> (*callback)
> (rtx_insn *))
> + : m_max_slots (max_slots), m_num_constraints (num_constraints),
> + m_callback (callback), m_violation (false)
> +{
> + m_free_slots = XNEWVEC (int, num_constraints);
> + reset_constraints ();
> +}
> +
Why not pass dispatch_constraint_info directly by reference here?
> + for (unsigned i = 0; i < constraints.length (); i++)
> + {
> + const auto &constraint = constraints[i];
More of a stylistic issue, why not use a for-each iterator
for (const auto &constraint : constraints)
Same with the other places we iterate.
> + for (unsigned i = 0; i < constraints.length (); i++)
> + {
> + const auto &constraint = constraints[i];
> + if (constraint.second > 0)
> + {
Can this ever not be true? if the instruction doesn't use the slot
it wouldn't be in the constraints list with this version no?
In patch 3:
> +static vec<std::pair<int, int>>
> +neoversev2_dispatch_constraint_callback (rtx_insn *insn)
> +{
> + auto dispatch_group = get_attr_neoversev2_dispatch (insn);
> + vec<std::pair<int, int>> constraints = vNULL;
> +
> + /* total slots */
> + constraints.safe_push (std::make_pair (0, 1));
> +
> + switch (dispatch_group)
> + {
> + case NEOVERSEV2_DISPATCH_NONE:
> + break;
> +
> + case NEOVERSEV2_DISPATCH_BS01:
> + constraints.safe_push (std::make_pair (1, 1)); /* b, s0, s1
> pipelines. */
> + constraints.safe_push (std::make_pair (4, 1)); /* b, s, m pipelines.
> */
> + break;
> +
You can use brace initialization here, e.g.
case NEOVERSEV2_DISPATCH_BS01:
constraints.safe_push ({1, 1}); /* b, s0, s1 pipelines. */
constraints.safe_push ({4, 1}); /* b, s, m pipelines. */
break;
However the vast majority of these are 1, except for 2 entries
> +
> + case NEOVERSEV2_DISPATCH_V_V13:
> + constraints.safe_push (std::make_pair (7, 2)); /* v pipelines (2
> slots). */
> + break;
> +
like this one. How about dropping the std::pair and instead just adding the
element twice to
the constaints list.
case NEOVERSEV2_DISPATCH_V_V13:
constraints.safe_push (7); /* v pipelines. */
constraints.safe_push (7); /* v pipelines. */
break;
That would simplify the code in patch 2 slightly too.
I think we should also use an enum here (just use implicit conversion to cast
the enum back to int)
such that the numbers mean something.
case NEOVERSEV2_DISPATCH_V_V13:
constraints.safe_push (V_PIPE); /* v pipelines. */
constraints.safe_push (V_PIPE); /* v pipelines. */
break;
Aside from these I think we're mostly there.
Thanks,
Tamar