Re: autovectorization in gcc

Szabolcs Nagy Thu, 10 Jan 2019 03:12:27 -0800

On 10/01/2019 08:19, Richard Biener wrote:
> On Wed, 9 Jan 2019, Jakub Jelinek wrote:
> 
>> On Wed, Jan 09, 2019 at 11:10:25AM -0500, David Malcolm wrote:
>>> extern void vf1()
>>> {
>>>    #pragma vectorize enable
>>>    for ( int i = 0 ; i < 32768 ; i++ )
>>>      data [ i ] = std::sqrt ( data [ i ] ) ;
>>> }
>>>
>>> Compiling on this x86_64 box with -fopt-info-vec-missed shows the
>>
>>>   _7 = .SQRT (_1);
>>>   if (_1 u>= 0.0)
>>>     goto <bb 8>; [99.95%]
>>>   else
>>>     goto <bb 4>; [0.05%]
>>>
>>>   <bb 8> [local count: 1062472912]:
>>>   goto <bb 5>; [100.00%]
>>>
>>>   <bb 4> [local count: 531495]:
>>>   __builtin_sqrtf (_1);
>>>
>>> I'm not sure where that control flow came from: it isn't in
>>>   sqrt-test.cc.104t.stdarg
>>> but is in
>>>   sqrt-test.cc.105t.cdce
>>> so I think it's coming from the argument-range code in cdce.
>>>
>>> Arguably the location on the statement is wrong: it's on the loop
>>> header, when it presumably should be on the std::sqrt call.
>>
>> See my either mail, it is the result of the -fmath-errno default,
>> the inline emitted sqrt doesn't handle errno setting and we emit
>> essentially x = sqrt (arg); if (__builtin_expect (arg < 0.0, 0)) sqrt (arg); 
>> where
>> the former sqrt is inline using HW instructions and the latter is the
>> library call.
>>
>> With some extra work we could vectorize it; e.g. if we make it handle
>> OpenMP #pragma omp ordered simd efficiently, it would be the same thing
>> - allow non-vectorizable portions of vectorized loops by doing there a
>> scalar loop from 0 to vf-1 doing the non-vectorizable stuff + drop the 
>> limitation
>> that the vectorized loop is a single bb.  Essentially, in this case it would
>> be
>>   vec1 = vec_load (data + i);
>>   vec2 = vec_sqrt (vec1);
>>   if (__builtin_expect (any (vec2 < 0.0)))
>>     {
>>       for (int i = 0; i < vf; i++)
>>         sqrt (vec2[i]);
>>     }
>>   vec_store (data + i, vec2);
>> If that would turn to be way too hard, we could for the vectorization
>> purposes hide that into the .SQRT internal fn, say add a fndecl argument to
>> it if it should treat the exceptional cases some way so that the control
>> flow isn't visible in the vectorized loop.
> 
> If we decide it's worth the trouble I'd rather do that in the epilogue
> and thus make the any (vec2 < 0.0) a reduction.  Like
> 
>    smallest = min(smallest, vec1);
> 
> and after the loop do the errno thing on the smallest element.
> 
> That said, this is a transform that is probably worthwhile even
> on scalar code, possibly easiest to code-gen right from the start
> in the call-dce pass.


if this is useful other than errno handling then fine,
but i think it's a really bad idea to add optimization
complexity because of errno handling: nobody checks
errno after sqrt (other than conformance test code).

-fno-math-errno is almost surely closer to what the user
wants than trying to vectorize the errno handling.

Re: autovectorization in gcc

Reply via email to