https://gcc.gnu.org/bugzilla/show_bug.cgi?id=106346

--- Comment #5 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Tamar Christina from comment #4)
> I believe the problem is actually g:27842e2a1eb26a7eae80b8efd98fb8c8bd74a68e
> 
> We added an optab for the widening left shift pattern there however the
> operation requires a uniform shift constant to work. See
> https://godbolt.org/z/4hqKc69Ke
> 
> The existing pattern that deals with this is vect_recog_widen_shift_pattern
> which is a scalar pattern.  during build_slp it validates that constants are
> the same and when they're not it aborts SLP.  This is why we lose
> vectorization.  Eventually we hit V4HI for which we have no widening shift
> optab for and it vectorizes using that low VF.
> 
> This example shows a number of things wrong:
> 
> 1. The generic costing seems off, this sequence shouldn't have been
> generated, as a vector sequence it's more inefficient than the scalar
> sequence. Using -mcpu=neover-n1 or any other costing structure correctly
> only gives scalar.
> 
> 2. vect_recog_widen_shift_pattern is implemented in the wrong place.  It
> predates the existence of the SLP pattern matcher. Because of the uniform
> requirements it's better to use the SLP pattern matcher where we have access
> to all the constants to decide whether the pattern is a match or not.  That
> way we don't abort SLP. Are you ok with this as a fix Richi?

patterns are difficult beasts - I think vect_recog_widen_shift_pattern is
at the correct place but instead what is lacking is SLP discovery support
for scrapping it - that is, ideally the vectorizer would take patterns as
a hint and ignore them when they are not helpful.

Now - in theory, for SLP vectorization, all patterns could be handled
as SLP patterns and scalar patterns disabled.  But that isn't easy to
do either.

I fear to fight this regression the easiest route is to pretend the
ISA can do widen shift by vector and fixup in the expander ...

> 3. The epilogue costing seems off..
> 
> This example https://godbolt.org/z/YoPcWv6Td ends up generating an
> exceptionally high epilogue cost and so thinks vectorization at the higher
> VF is not profitable.
> 
> *src1_18(D) 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 2B] 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 4B] 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 6B] 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 8B] 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 10B] 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 12B] 1 times vec_to_scalar costs 2 in epilogue
> MEM[(uint16_t *)src1_18(D) + 14B] 1 times vec_to_scalar costs 2 in epilogue
> /app/example.c:16:12: note: Cost model analysis for part in loop 0:
>   Vector cost: 23
>   Scalar cost: 17

I don't see any epilogue cost - the example doesn't have a loop.  With BB
vect you could see no epilogue costs?

> For some reason it thinks it needs a scalar epilogue? using
> -fno-vect-cost-model gives the desired codegen.

Reply via email to