> -----Original Message-----
> From: Richard Biener <rguent...@suse.de>
> Sent: Tuesday, May 13, 2025 12:08 PM
> To: Richard Sandiford <richard.sandif...@arm.com>
> Cc: gcc-patches@gcc.gnu.org; Tamar Christina <tamar.christ...@arm.com>
> Subject: Re: [PATCH][RFC] Add vector_costs::add_vector_cost vector stmt
> grouping hook
> 
> On Tue, 13 May 2025, Richard Sandiford wrote:
> 
> > Richard Biener <rguent...@suse.de> writes:
> > > The following refactors the vectorizer vector_costs target API
> > > to add a new vector_costs::add_vector_cost entry which groups
> > > all individual sub-stmts we create per "vector stmt", aka SLP
> > > node.  This allows for the targets to more easily match on
> > > complex cases like emulated gather/scatter or even just vector
> > > construction.
> > >
> > > The patch itself is just a prototype and leaves out BB vectorization
> > > for simplicity.  It also does not fully group all vector stmts
> > > but leaves some bare add_stmt_cost hook invocations.  I'd expect
> > > the add_stmt_hook to be still used for scalar stmt costing and
> > > for costing added branching around prologue/epilogue.  The
> > > default implementation of add_vector_cost just dispatches to
> > > add_stmt_cost for individual stmts.  Eventually the actual data
> > > we track for the combined costing will diverge (no need to track
> > > SLP node or stmt_info there?), so targets would eventually be
> > > expected to implement both hooks and splice out common workers
> > > to deal with "missing" information coming in from the different
> > > entries.
> > >
> > > This should eventually baby-step us towards the generic vectorizer
> > > code being able to compute and compare latency and resource
> > > utilization throughout the scalar / vector loop iteration based
> > > on latency and throughput data determined on a stmt-by-stmt base
> > > from the target.  As given the grouping should be an incremental
> > > improvement, but I have not tried to see how it can simplify
> > > the x86 hook implementation - I've been triggered by the aarch64
> > > reported bootstrap fail on the cleanup RFC I posted given that
> > > code wants to identify a scalar load that's costed as part of
> > > a gather/scatter operation.
> > >
> > > Any comments or problems you forsee?
> >
> > Could the stmt_vector_for_cost pointer instead be passed to
> > TARGET_VECTORIZE_CREATE_COSTS?  The danger with passing it to
> > add_vector_cost is that the same vector_costs instance might get used
> > for multiple different costing attempts, so that only the provided
> > stmt_vector_for_costs are specific to the current costing attempt.
> > But for complex cases, the target's vector_costs should be able
> > to cache its own target-specific information, with the same
> > lifetime/scope as the stmt_vector_for_costs.
> 
> It cannot be passed to TARGET_VECTORIZE_CREATE_COSTS - but I can
> not pass it at all, in the proposed implementation it is
> actually node->cost_vec.  It's the set of stmts we cost for
> a single SLP node.  I'm not sure the "group" is what targets
> would cache, they'd rather cache whatever they make from the
> group and its contents?
> 
> That said, the most aggressive way of handling it would be
> to defer everything to the target and just pass in the
> set of SLP instances to TARGET_VECTORIZE_CREATE_COSTS and
> not perform any individual add_stmt_cost calls at all, but expect
> the target to walk the SLP graph at finish_cost () time.
> 

I was actually wondering whether it wouldn't be indeed better to cost
the slp_instances as those contain roots that would need to be costed
too.

For early break if we're costing purely based on SLP node then the
actual break itself can't be costed as it's not in the node.  We'd need
this to be able to do this to be able to re-order the exits during slp
scheduling based on their actual cost.

Cheers,
Tamar

> The x86 target currently keeps counters of certain ops but
> does not cache the full-blown stmts from add_stmt_cost for
> computing the overall cost at finish_cost.  I'll have to look
> what aarch64 does here.
> 
> Ultimatively I'd like to take into account stmt dependences
> during costing - at the moment we are asking the target to
> compute per stmt "latencies" but then we just sum those.
> One improvement would be to compute the max latency through
> the graph and the maximum width (without having throughput
> or port assignments and an actual scheduler implementation).
> 
> Richard.
> 
> >
> > Thanks,
> > Richard
> >
> 
> --
> Richard Biener <rguent...@suse.de>
> SUSE Software Solutions Germany GmbH,
> Frankenstrasse 146, 90461 Nuernberg, Germany;
> GF: Ivo Totev, Andrew McDonald, Werner Knoblich; (HRB 36809, AG Nuernberg)

Reply via email to