https://gcc.gnu.org/bugzilla/show_bug.cgi?id=82344
Richard Biener <rguenth at gcc dot gnu.org> changed: What |Removed |Added ---------------------------------------------------------------------------- Keywords| |ra Status|UNCONFIRMED |NEW Last reconfirmed| |2017-10-12 Component|target |middle-end Ever confirmed|0 |1 --- Comment #3 from Richard Biener <rguenth at gcc dot gnu.org> --- So the newton-raphson step causes register pressure to increase and post haswell this makes code slower than not using rsqrt (thus using sqrtf and a division)? I wonder whether it would be profitable to SLP vectorize this (of course we're not considering this because SLP vectorization is looking for stores). SLP vectorization would need to do 4 (or 8 with avx256) vector inserts and extracts but then could do the rsqrt and newton raphson together. The argument computation to the sqrt also loop vectorizable and the ultimate operands even come from continuous memory. One of the tricky parts would be to see that the only first rsqrt arg is re-used and thus taking rinv21 to rinv33 (8 rsqrts) for the vectorization is probably best. rinv11 = 1.0/sqrt(rsq11) rinv21 = 1.0/sqrt(rsq21) rinv31 = 1.0/sqrt(rsq31) rinv12 = 1.0/sqrt(rsq12) rinv22 = 1.0/sqrt(rsq22) rinv32 = 1.0/sqrt(rsq32) rinv13 = 1.0/sqrt(rsq13) rinv23 = 1.0/sqrt(rsq23) rinv33 = 1.0/sqrt(rsq33) r11 = rsq11*rinv11 What does ICC do to this loop? I can confirm the regression on our tester (a Haswell machine btw).