[Bug tree-optimization/85050] Vectorized function - suboptimal gather

jakub at gcc dot gnu.org Fri, 23 Mar 2018 08:54:12 -0700

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=85050


Jakub Jelinek <jakub at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2018-03-23
                 CC|                            |hjl.tools at gmail dot com,
                   |                            |itsimbal at gcc dot gnu.org,
                   |                            |jakub at gcc dot gnu.org,
                   |                            |jkoval at gcc dot gnu.org,
                   |                            |rguenth at gcc dot gnu.org
     Ever confirmed|0                           |1

--- Comment #1 from Jakub Jelinek <jakub at gcc dot gnu.org> ---
This is because the GCC vectorizer does vectorization based on vector size,
rather than vectorization factor.  And that we represent internally the declare
simd as implicit loops and use the vectorizer to vectorize them.

For loops without mixed type sizes that makes no difference, but the loop in
question has 32-bit and 64-bit sizes (int and double).  Also, with mixed type
sizes we use the smallest element type for given vector size as determining the
vectorization factors and for wider element types we use 2, 4, 8 etc. separate
registers.  So, e.g. for the _ZGVd* clone, which is AVX2, we start with vector
size 32, which would mean V4DF and V8SI modes, so we'd in that case vectorize
it
as 2x V4DF registers and 1x V8SI.  But we give up on that because the implicit
loop doesn't have enough iterations (it has exactly 4, not 8+).  Thus we
vectorize with vector size of 16, with 2x V2DF registers and 1x V4SI.
What ICC does instead is it vectorizes with vectorization factor of 4 and
determines the register sizes based on that, so V4DF and V4SI.

This can be beneficial even outside of declare simd, at least for -O2
-ftree-vectorize vectorization, so we get smaller vectorized loops, if the
smaller vectorization factor fits one supported vector size and larger another
one.  Not really sure how this would play well with the -mprefer-vector-width=
stuff and attempts to avoid 512-bits vectors due to CPU lowering frequencies.

[Bug tree-optimization/85050] Vectorized function - suboptimal gather

Reply via email to