https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118057

--- Comment #6 from Robin Dapp <rdapp at gcc dot gnu.org> ---
(In reply to Richard Biener from comment #5)
> I would expect this to be always slower when vectorized unless the core is
> seriously bottle-necked on the frontend.  The loads/stores need to be
> decomposed to separate uops, there's no actual vector operation.  The vector
> op introduces an artificial dependence between otherwise independent lanes
> which could execute OOO in scalar.
> 
> I think GCC behaves better here.

In the end a strided vector implementation will have to perform something
similar internally to what element-wise scalar accesses would do, yeah.  So not
exactly a poster child for vectorization.

But there might be uarchs (not sure about ours, but possible) that can do more
strided vector elements loads per cycle than scalar loads without being
severely frontend bottle-necked.  And those vector ops could still execute OOO,
just as larger chunks (or possibly even the uops)?

Reply via email to