andrew.w.kaylor added a comment. This example illustrates the problem this patch intends to fix: https://godbolt.org/z/j445sxPMc
For Intel microarchitectures before Skylake, the LLVM cost model says that vector fsqrt is slow, so if fast-math is enabled, we'll use an approximation rather than the vsqrtps instruction when vectorizing a call to sqrtf(). If the code is compiled with -march=skylake or -mtune=skylake, we'll choose the vsqrtps instruction, but with any earlier base target, we'll choose the approximation even if there is a cpu_specific(skylake) implementation in the source code. For example __attribute__((cpu_specific(skylake))) void foo(void) { for (int i = 0; i < 8; ++i) x[i] = sqrtf(y[i]); } compiles to foo.b: vmovaps ymm0, ymmword ptr [rip + y] vrsqrtps ymm1, ymm0 vmulps ymm2, ymm0, ymm1 vbroadcastss ymm3, dword ptr [rip + .LCPI2_0] # ymm3 = [-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0,-3.0E+0] vfmadd231ps ymm3, ymm2, ymm1 # ymm3 = (ymm2 * ymm1) + ymm3 vbroadcastss ymm1, dword ptr [rip + .LCPI2_1] # ymm1 = [-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1,-5.0E-1] vmulps ymm1, ymm2, ymm1 vmulps ymm1, ymm1, ymm3 vbroadcastss ymm2, dword ptr [rip + .LCPI2_2] # ymm2 = [NaN,NaN,NaN,NaN,NaN,NaN,NaN,NaN] vandps ymm0, ymm0, ymm2 vbroadcastss ymm2, dword ptr [rip + .LCPI2_3] # ymm2 = [1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38,1.17549435E-38] vcmpleps ymm0, ymm2, ymm0 vandps ymm0, ymm0, ymm1 vmovaps ymmword ptr [rip + x], ymm0 vzeroupper ret but it should compile to foo.b: vsqrtps ymm0, ymmword ptr [rip + y] vmovaps ymmword ptr [rip + x], ymm0 vzeroupper ret CHANGES SINCE LAST ACTION https://reviews.llvm.org/D121410/new/ https://reviews.llvm.org/D121410 _______________________________________________ cfe-commits mailing list cfe-commits@lists.llvm.org https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-commits