https://gcc.gnu.org/bugzilla/show_bug.cgi?id=107093

--- Comment #7 from Richard Biener <rguenth at gcc dot gnu.org> ---
(In reply to Hongtao.liu from comment #5)
> Also i think masked epilog(--param=vect-partial-vector-usage=1) should be
> good for general cases under AVX512, espicially when main loop's vector
> width is 512, and the remain tripcount is not enough for 256-bit
> vectorization but ok for 128-bit vectorization.

Yes, for the fully masked variant I was mostly targeting -O2 with its
very-cheap (size wise) cost model.  Since we don't vectorize the
epilogue of a vectorized epilogue (yet) going fully masked there
should indeed help.  Also when we start to use the unroll hint the
vectorized epilogue might get full width iterations to handle as well.

One downside for a fully masked body is that we're using masked stores
which usually have higher latency due to the "merge" semantics which
means an extra memory input + merge operation.  Not sure if modern
uArchs can optimize the all-ones mask case, the vectorizer, for
.MASK_STORE, still has the code to change those to emit a mask
compare against all-zeros and only conditionally doing a .MASK_STORE.
That could be enhanced to single out the all-ones case, at least for
the .MASK_STOREs in a main fully masked loop when the mask is only
from the iteration (rather than conditional execution).

Reply via email to