https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98138

--- Comment #6 from Richard Biener <rguenth at gcc dot gnu.org> ---
Starting from the loads is not how SLP discovery works so there will be
zero re-use of code.  Sure - the only important thing is you end up
with a valid SLP graph.

But going back to the original testcase and the proposed vectorization
for power - is that faster in the end?

For the "rewrite" of the vectorizer into all-SLP we do have to address
that "interleaving scheme not carried out as interleaving" at some point,
but that's usually for loop vectorization - for BB vectorization all
we have is optimize_slp.  I have patches that would build the vector load
SLP node (you still have to kill that 'build from scalars' thing to make
it trigger ).  But then we end up with a shared vector load node and
N extract/splat operations at the 'scalar' points.  It's not entirely
clear to me how to re-arrange the SLP graph at that point.

Btw, on current trunk the simplified testcase no longer runs into the
'scalar operand' build case but of course vectorization is thought to be
not profitable.  pattern recog of the plus/minus subgraphs may help
(not sure if ppc has those as instruction, x86 has).

That said, "failure" to identify the common (vector) load is known
and I do have experimental patches trying to address that but did
not yet arrive at a conclusive "best" approach.

Reply via email to