[Bug tree-optimization/119187] vectorizer should be able to SLP already vectorized code

rguenth at gcc dot gnu.org via Gcc-bugs Tue, 18 Mar 2025 02:29:45 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=119187


--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #5)
> (In reply to Richard Biener from comment #4)
> > 
> >  for (...)
> >    a[32*i] = ..;
> >    a[32*i+1] = ..;
> > ...
> >    a[32*i + 31] = ...;
> > 
> > to match the number of lanes in a HW vector.  It shares some of the same
> > issues as handling vector "scalar" types.
> > 
> > One issue with vector "scalar" types is that there's no scalar defs for
> > the actual lanes - that's somewhat of a representational issue, but we
> > do use those for scheduling for example.  Having the "anchor" explicitly
> > represented would be a first step to solve this.
> 
> I've now gotten far enough to realize what the hints here were trying to say.
> And indeed without solving this representational issue we'd have no way to
> build a correct SLP tree. and because the number of defs, vf and lanes don't
> match at the moment I get the high unroll factor.
> 
> representationally I guess the question is whether we want to track the
> vector "scalar" lanes using the same fields we current track normal scalar
> lanes.
> 
> I'm leaning towards a no, and that we'd want to have a vector_stmts field
> next to scalar_stmts and have a vec_lanes and scalar_lanes and the current
> lanes values would be vec_lanes + scalar_lanes?

I think we do want to track definitions in scalar code that are composing
the lanes of a SLP node.  If and if only to tell whether we can easily
split a node for example.  Consider having two V4SI defs for a node, we
can easily split at V4SI boundary but other ops would require code
generation.

I have mixed feelings about having vec_lanes and scalar_lanes, even if
it's going to be a bit awkward to work with I'd prefer a common
lane_defs that could for example be { _1[SI], _2[V2SI], _3[SI] }, so
we'd use SLP_TREE_LANES for the number of lanes (IIRC we have
SLP_TREE_SCALAR_STMTS.length () still in a few places), and iteration
over lane_defs would need to cope with vector defs.  The tricky bit
will be how to interpret NULL lanes which is now allowed, representing
group gaps for example, I'd say a NULL lane def would map to a single
scalar lane.

What this would not allow is encode a permuted vector lane def in the
permuted vectors lane defs (for scalars we just permute them into the
correct place) - this shouldn't necessarily be a problem, we simply do
not have a SSA def in the scalar IL for the respective lanes.  We can
simply put NULL there again, though I've almost used 'NULL' as dont-care
here (maybe I did, I'd have to check), so we should think of an encoding
that allows for both 'no scalar def' and 'we don't care for the lanes value'
(for the SLP node this does not necessarily have to be in the scalar-defs
representation - but I tried to use dont-care in the bst_map which is
only indexed by scalar defs).

Note there's currently a hack in place to handle vector defs as externs.
A good proof of design is to make that work without the hack but with
proper lane defs (but still vect_external_def of course).

> That would make the changes in vect_build_slp_instance easier to follow but
> would make e.g. vect_update_slp_vf_for_node automatically do the right thing.
> 
> It would also make it possible to allow combinations of scalar and vector
> code so that
> 
>  for (...)
>    a[32*i] = ..;
>    vec<a>[32*i+5] = ..;
> 
> be possible?

Yes, I think it's important we try to handle this from a design point of view
(not necessarily initially).  Likewise different vector mode defs.

> > It's one of the TODOs that look easy but are not.  Related is to support a
> > fractional VF so we can re-roll
> 
> I guess the fractionality would only be in the subcomponents right? i.e. in
> the above the scalar variants would have a fractional VF. But the overall VF
> should still be as is today?
> 
> Just processing what your thinking around this is.

A fractional VF would be a transform possibility.  The scalar code represents
VF == 1, if all SLP instances have N_i lanes and all N_i are a multiple of
VF_x then the minimal vectorization factor would be 1/VF_x and we'd
re-roll the loop body as part of the transform instead of unrolling it.

[Bug tree-optimization/119187] vectorizer should be able to SLP already vectorized code

Reply via email to