On 10/01/2019 08:19, Richard Biener wrote: > On Wed, 9 Jan 2019, Jakub Jelinek wrote: > >> On Wed, Jan 09, 2019 at 11:10:25AM -0500, David Malcolm wrote: >>> extern void vf1() >>> { >>> #pragma vectorize enable >>> for ( int i = 0 ; i < 32768 ; i++ ) >>> data [ i ] = std::sqrt ( data [ i ] ) ; >>> } >>> >>> Compiling on this x86_64 box with -fopt-info-vec-missed shows the >> >>> _7 = .SQRT (_1); >>> if (_1 u>= 0.0) >>> goto <bb 8>; [99.95%] >>> else >>> goto <bb 4>; [0.05%] >>> >>> <bb 8> [local count: 1062472912]: >>> goto <bb 5>; [100.00%] >>> >>> <bb 4> [local count: 531495]: >>> __builtin_sqrtf (_1); >>> >>> I'm not sure where that control flow came from: it isn't in >>> sqrt-test.cc.104t.stdarg >>> but is in >>> sqrt-test.cc.105t.cdce >>> so I think it's coming from the argument-range code in cdce. >>> >>> Arguably the location on the statement is wrong: it's on the loop >>> header, when it presumably should be on the std::sqrt call. >> >> See my either mail, it is the result of the -fmath-errno default, >> the inline emitted sqrt doesn't handle errno setting and we emit >> essentially x = sqrt (arg); if (__builtin_expect (arg < 0.0, 0)) sqrt (arg); >> where >> the former sqrt is inline using HW instructions and the latter is the >> library call. >> >> With some extra work we could vectorize it; e.g. if we make it handle >> OpenMP #pragma omp ordered simd efficiently, it would be the same thing >> - allow non-vectorizable portions of vectorized loops by doing there a >> scalar loop from 0 to vf-1 doing the non-vectorizable stuff + drop the >> limitation >> that the vectorized loop is a single bb. Essentially, in this case it would >> be >> vec1 = vec_load (data + i); >> vec2 = vec_sqrt (vec1); >> if (__builtin_expect (any (vec2 < 0.0))) >> { >> for (int i = 0; i < vf; i++) >> sqrt (vec2[i]); >> } >> vec_store (data + i, vec2); >> If that would turn to be way too hard, we could for the vectorization >> purposes hide that into the .SQRT internal fn, say add a fndecl argument to >> it if it should treat the exceptional cases some way so that the control >> flow isn't visible in the vectorized loop. > > If we decide it's worth the trouble I'd rather do that in the epilogue > and thus make the any (vec2 < 0.0) a reduction. Like > > smallest = min(smallest, vec1); > > and after the loop do the errno thing on the smallest element. > > That said, this is a transform that is probably worthwhile even > on scalar code, possibly easiest to code-gen right from the start > in the call-dce pass.
if this is useful other than errno handling then fine, but i think it's a really bad idea to add optimization complexity because of errno handling: nobody checks errno after sqrt (other than conformance test code). -fno-math-errno is almost surely closer to what the user wants than trying to vectorize the errno handling.