https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82344

Richard Biener <rguenth at gcc dot gnu.org> changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
           Keywords|                            |ra
             Status|UNCONFIRMED                 |NEW
   Last reconfirmed|                            |2017-10-12
          Component|target                      |middle-end
     Ever confirmed|0                           |1

--- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> ---
So the newton-raphson step causes register pressure to increase and post
haswell this makes code slower than not using rsqrt (thus using sqrtf and a
division)?

I wonder whether it would be profitable to SLP vectorize this (of course
we're not considering this because SLP vectorization is looking for stores).
SLP vectorization would need to do 4 (or 8 with avx256) vector inserts
and extracts but then could do the rsqrt and newton raphson together.
The argument computation to the sqrt also loop vectorizable and the ultimate
operands even come from continuous memory.  One of the tricky parts would be
to see that the only first rsqrt arg is re-used and thus taking
rinv21 to rinv33 (8 rsqrts) for the vectorization is probably best.

          rinv11           = 1.0/sqrt(rsq11)
          rinv21           = 1.0/sqrt(rsq21)
          rinv31           = 1.0/sqrt(rsq31)
          rinv12           = 1.0/sqrt(rsq12)
          rinv22           = 1.0/sqrt(rsq22)
          rinv32           = 1.0/sqrt(rsq32)
          rinv13           = 1.0/sqrt(rsq13)
          rinv23           = 1.0/sqrt(rsq23)
          rinv33           = 1.0/sqrt(rsq33)
          r11              = rsq11*rinv11

What does ICC do to this loop?

I can confirm the regression on our tester (a Haswell machine btw).

Reply via email to