https://gcc.gnu.org/bugzilla/show_bug.cgi?id=118057
--- Comment #6 from Robin Dapp <rdapp at gcc dot gnu.org> --- (In reply to Richard Biener from comment #5) > I would expect this to be always slower when vectorized unless the core is > seriously bottle-necked on the frontend. The loads/stores need to be > decomposed to separate uops, there's no actual vector operation. The vector > op introduces an artificial dependence between otherwise independent lanes > which could execute OOO in scalar. > > I think GCC behaves better here. In the end a strided vector implementation will have to perform something similar internally to what element-wise scalar accesses would do, yeah. So not exactly a poster child for vectorization. But there might be uarchs (not sure about ours, but possible) that can do more strided vector elements loads per cycle than scalar loads without being severely frontend bottle-necked. And those vector ops could still execute OOO, just as larger chunks (or possibly even the uops)?