andrew.w.kaylor added a comment.
This example illustrates the problem this patch intends to fix:
https://godbolt.org/z/j445sxPMc
For Intel microarchitectures before Skylake, the LLVM cost model says that
vector fsqrt is slow, so if fast-math is enabled, we'll use an approximation
rather than the vsqrtps instruction when vectorizing a call to sqrtf(). If the
code is compiled with -march=skylake or -mtune=skylake, we'll choose the
vsqrtps instruction, but with any earlier base target, we'll choose the
approximation even if there is a cpu_specific(skylake) implementation in the
source code.
For example
__attribute__((cpu_specific(skylake))) void foo(void) {
for (int i = 0; i < 8; ++i)
x[i] = sqrtf(y[i]);
}
compiles to
foo.b:
vmovaps ymm0, ymmword ptr [rip + y]
vrsqrtps ymm1, ymm0
vmulps ymm2, ymm0, ymm1
vbroadcastss ymm3, dword ptr [rip + .LCPI2_0] # ymm3 =
[-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0]
vfmadd231ps ymm3, ymm2, ymm1 # ymm3 = (ymm2 * ymm1) + ymm3
vbroadcastss ymm1, dword ptr [rip + .LCPI2_1] # ymm1 =
[-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1]
vmulps ymm1, ymm2, ymm1
vmulps ymm1, ymm1, ymm3
vbroadcastss ymm2, dword ptr [rip + .LCPI2_2] # ymm2 =
[NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN]
vandps ymm0, ymm0, ymm2
vbroadcastss ymm2, dword ptr [rip + .LCPI2_3] # ymm2 =
[1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38]
vcmpleps ymm0, ymm2, ymm0
vandps ymm0, ymm0, ymm1
vmovaps ymmword ptr [rip + x], ymm0
vzeroupper
ret
but it should compile to
foo.b:
vsqrtps ymm0, ymmword ptr [rip + y]
vmovaps ymmword ptr [rip + x], ymm0
vzeroupper
ret
CHANGES SINCE LAST ACTION
https://reviews.llvm.org/D121410/new/
https://reviews.llvm.org/D121410
_______________________________________________
cfe-commits mailing list
[email protected]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits