https://gcc.gnu.org/bugzilla/show_bug.cgi?id=103393

--- Comment #4 from John S <jschoen4 at gmail dot com> ---
I can Confirm from my side that it does appear to be the memmove inline
expansion and not the auto vectorizer.  It also occurs with
builtin_memset/builtin_memcpy as well.

For some context, this is an issue would prevent the usage of gcc in my
production environment.  It will certainly impact other use cases outside of my
own as well.  For example, it becomes impossible to use "-mno-vzeroupper -mavx
-mpreferred-vector-width=128" and use _mm256_xxx + _mm256_zeroupper()
intrinsics to properly manage the ymm state (clear or not) since the compiler
is now able to insert ymm's almost anywhere via the memmove inlining.

Up until now the prefer-width has always behaved as in a way that all auto
generated vector uses will not exceed the preferred width.  Only explicit use
of the _mm256/_mm512_ .. intrinsics or the "vector types" i.e. `__m256 var;
__m512 var;` would result in wider register usage.

I do believe Clang/icc behave this way as well and there are dependencies on
this behavior.  The same also applies w/ avx-512 enabled with ZMM usage +
prefer=128/256 where the downclocking issues can be even more pronounced.

Reply via email to