https://gcc.gnu.org/bugzilla/show_bug.cgi?id=88713
--- Comment #35 from Chris Elrod <elrodc at gmail dot com> ---
> rsqrt:
> .LFB12:
> .cfi_startproc
> vrsqrt28ps (%rsi), %zmm0
> vmovups %zmm0, (%rdi)
> vzeroupper
> ret
>
> (huh? isn't there a NR step missing?)
>
I assume because vrsqrt28ps is much more accurate than vrsqrt14ps, it wasn't
considered necessary. Unfortunately, march=skylake-avx512 does not have
-mavx512er, and therefore should use the less accurate vrsqrt14ps + NR step.
I think vrsqrt14pd/s are -mavx512f or -mavx512vl
> Without -mavx512er, we do not have an expander for rsqrtv16sf2, and without
> that I don't know how the machinery can guess how to use rsqrt (there are
> probably ways).
Looking at the asm from only r[i] = sqrtf(a[i]):
vmovups (%rsi), %zmm1
vxorps %xmm0, %xmm0, %xmm0
vcmpps $4, %zmm1, %zmm0, %k1
vrsqrt14ps %zmm1, %zmm0{%k1}{z}
vmulps %zmm1, %zmm0, %zmm1
vmulps %zmm0, %zmm1, %zmm0
vmulps .LC1(%rip), %zmm1, %zmm1
vaddps .LC0(%rip), %zmm0, %zmm0
vmulps %zmm1, %zmm0, %zmm0
vmovups %zmm0, (%rdi)
vs the asm from only r[i] = 1 /a[i]:
vmovups (%rsi), %zmm1
vrcp14ps %zmm1, %zmm0
vmulps %zmm1, %zmm0, %zmm1
vmulps %zmm1, %zmm0, %zmm1
vaddps %zmm0, %zmm0, %zmm0
vsubps %zmm1, %zmm0, %zmm0
vmovups %zmm0, (%rdi)
it looks like the expander is there for sqrt, and for inverse, and we're just
getting both one after the other. So it does look like I could benchmark which
one is slower than the regular instruction on my platform, if that would be
useful.