SLP vectorization possible inefficiency

rdapp at gcc dot gnu.org via Gcc-bugs Wed, 08 Jan 2025 01:45:29 -0800

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=115340


--- Comment #2 from Robin Dapp <rdapp at gcc dot gnu.org> ---
> The stores are not considered "grouped" because they have gaps.

> To do better we'd have to improve the store dataref analysis to see
> that a vectorization factor of four would "close" the gaps, or more
> generally support store groups with gaps.  Stores with gaps can be
> handled by masking for example.

I have been stepping through the code and experimenting a bit, starting with
the store side.

    for( int i = 0; i < 4; i++ )
    {
        out[i + 0] = tmp[0][i] + 1;

Those stores are not considered grouped because the step is constant and the
access for the data-ref itself is contiguous.  We discover four of those (as
you mentioned before).

With some dirty hacks (i.e. continuing the group discovery even for the
contiguous case and annotate the statements/refs with a special flag) it is
possible to discover the full group of four and mark the stores as related.

Then (again as you said) the lack of store with gap support is still in the way
but for the case here we could just ignore the gap at the early discovery
phase.  We'd just need to make sure the current behavior is preserved for all
other cases.

Is that a way forward?  I was thinking of adding another memory access type
like VMAT_GAP_CLOSING (or whatever fitting name) for such cases.  In the
analysis part we'd need to verify that the vectorization factor matches the
group gap as well as support for a large vector type etc.  If everything
succeeded we could emit a large store instead of the four individual ones.

Or is that too specific?  If we had full store-with-gaps support, let's say
using masking, we'd still need dedicated handling for cases where the gaps
vanish I suppose.

[Bug tree-optimization/115340] Loop/SLP vectorization possible inefficiency

Reply via email to