https://gcc.gnu.org/bugzilla/show_bug.cgi?id=101097
--- Comment #4 from Richard Biener <rguenth at gcc dot gnu.org> --- (In reply to Hongtao.liu from comment #2) > (In reply to Richard Biener from comment #1) > > Hmm, so the difference is that we use loop vect for 'foo' but fail to do > > that for 'bar' and BB vect succeeds. Disabling loop vect but enabling BB > > vect also produces optimal code for 'foo' (unrolling happens before): > > > > foo: > > .LFB0: > > .cfi_startproc > > vpmovzxwd (%rsi), %ymm0 > > vpmovzxwd (%rdi), %ymm1 > > vpaddd %ymm1, %ymm0, %ymm0 > > vmovdqu %ymm0, (%rdx) > > vzeroupper > > > > the key difference in the vectorizer is that BB vect supports different > > vector sizes in the same instance but the loop vectorizer can only use > > a single vector size. > Is there any plan for extending loop vectorizer to handle different vector > sizes? It's not an easy task - we're committing to vector types stmt-local and quite early (vect_determine_vectorization_factor), the same is in principle true for BB vect but there we know the vectorization factor beforehand (it's 1 - we can't unroll a BB) and thus see to tweak the vector size instead of failing. What would need to be done is determine the output vector type in vectorizable_conversion based on the input vector types. But then that would need to be another phase of vectorizable_* calls since the final vectorization factor would not be set. The whole thing is related to vector size iteration where the idea would be to somehow compute for each stmt a set of input & output vector types that the target supports and then somehow select sets that we want to send to costing. As said - a lot of work, sth that might be easier when we got rid of the SLP vs. non-SLP duality.