https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85050
Jakub Jelinek <jakub at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Status|UNCONFIRMED |NEW Last reconfirmed| |2018-03-23 CC| |hjl.tools at gmail dot com, | |itsimbal at gcc dot gnu.org, | |jakub at gcc dot gnu.org, | |jkoval at gcc dot gnu.org, | |rguenth at gcc dot gnu.org Ever confirmed|0 |1 --- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> --- This is because the GCC vectorizer does vectorization based on vector size, rather than vectorization factor. And that we represent internally the declare simd as implicit loops and use the vectorizer to vectorize them. For loops without mixed type sizes that makes no difference, but the loop in question has 32-bit and 64-bit sizes (int and double). Also, with mixed type sizes we use the smallest element type for given vector size as determining the vectorization factors and for wider element types we use 2, 4, 8 etc. separate registers. So, e.g. for the _ZGVd* clone, which is AVX2, we start with vector size 32, which would mean V4DF and V8SI modes, so we'd in that case vectorize it as 2x V4DF registers and 1x V8SI. But we give up on that because the implicit loop doesn't have enough iterations (it has exactly 4, not 8+). Thus we vectorize with vector size of 16, with 2x V2DF registers and 1x V4SI. What ICC does instead is it vectorizes with vectorization factor of 4 and determines the register sizes based on that, so V4DF and V4SI. This can be beneficial even outside of declare simd, at least for -O2 -ftree-vectorize vectorization, so we get smaller vectorized loops, if the smaller vectorization factor fits one supported vector size and larger another one. Not really sure how this would play well with the -mprefer-vector-width= stuff and attempts to avoid 512-bits vectors due to CPU lowering frequencies.