https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138
--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> --- Starting from the loads is not how SLP discovery works so there will be zero re-use of code. Sure - the only important thing is you end up with a valid SLP graph. But going back to the original testcase and the proposed vectorization for power - is that faster in the end? For the "rewrite" of the vectorizer into all-SLP we do have to address that "interleaving scheme not carried out as interleaving" at some point, but that's usually for loop vectorization - for BB vectorization all we have is optimize_slp. I have patches that would build the vector load SLP node (you still have to kill that 'build from scalars' thing to make it trigger ). But then we end up with a shared vector load node and N extract/splat operations at the 'scalar' points. It's not entirely clear to me how to re-arrange the SLP graph at that point. Btw, on current trunk the simplified testcase no longer runs into the 'scalar operand' build case but of course vectorization is thought to be not profitable. pattern recog of the plus/minus subgraphs may help (not sure if ppc has those as instruction, x86 has). That said, "failure" to identify the common (vector) load is known and I do have experimental patches trying to address that but did not yet arrive at a conclusive "best" approach.